Python for Data Science

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
%config Completer.use_jedi = False # faster auto complete
#plot
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
# set style
plt.style.use('classic')
%matplotlib inline 

Notebook Environment

In [248]:
#get help on python command
print(help(len))
len?
''' or insert[Shift]+[Tab]'''
Help on built-in function len in module builtins:

len(obj, /)
    Return the number of items in a container.

None
Signature: len(obj, /)
Docstring: Return the number of items in a container.
Type:      builtin_function_or_method
In [17]:
#also work on methods
A = [1,2,3]
A.insert?
Signature: A.insert(index, object, /)
Docstring: Insert object before index.
Type:      builtin_function_or_method
In [24]:
#accessing source code
sum??
Signature: sum(iterable, start=0, /)
Docstring:
Return the sum of a 'start' value (default: 0) plus an iterable of numbers

When the iterable is empty, return the start value.
This function is intended specifically for use with numeric values and may
reject non-numeric types.
Type:      builtin_function_or_method
In [42]:
#Wild card search for object
Small_list = [1,2,3,4]
Big_list = [1,2,3,4,5,6,7,8,9]
#find every object that has list
*lis*?
Big_list
Small_list
list
mylist
In [ ]:
#download dataset from url
# !curl -O https://xxxxxx/chipotle.tsv
!curl -O https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv 
# chipo = pd.read_csv('chipotle.tsv', delimiter='\t') #delimiter (\t for tab) and (, for comma)
chipo  = pd.read_csv('chipotle.tsv',delimiter='\t')

Numpy

Lets Think of all data fundamentally as arrays of numbers.

For example, images—particularly digital images—can be thought of as simply twodimensional arrays of numbers representing pixel brightness across the area.
Sound clips can be thought of as one-dimensional arrays of intensity versus time.
Text can be converted in various ways into numerical representations, perhaps binary digits representing the frequency of certain words or pairs of words.
No matter what the data are, the first step in making them analyzable will be to transform them into arrays of numbers.

NumPy (short for Numerical Python) provides an efficient interface to store and operate on dense data buffers. In some ways, NumPy arrays are like Python’s built-in list type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size

In [1]:
import numpy as np
np.__version__
Out[1]:
'1.17.4'

Create Np.Array

List in python can store different type, Numpy lacks of this function give it much more efficient in terms of storing and operation

In [55]:
#Create from scratch
y = np.array([4,3,2],dtype='float32') #can add type to np.array (optional)
#Create from list
mylist = [1,2,3]
x= np.array(mylist)
print('y=',y)
print('x=',x)
y= [4. 3. 2.]
x= [1 2 3]
In [56]:
# matrix fill with one
z = np.ones((3,5)) #create 3*5 matrix of 1
print('z=',z)
#range by .arange(start,stop,step)
m = np.arange(0,20,2)
print('m=',m)
z= [[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]
m= [ 0  2  4  6  8 10 12 14 16 18]
In [57]:
#create array of random int
np.random.randint(5,10,(3,2)) #random between 5 - 10 of 3*2 matrix
Out[57]:
array([[5, 6],
       [6, 6],
       [7, 9]])
Data type Description
bool_ Boolean (True or False) stored as a byte
int_ Default integer type (same as C long; normally either int64 or int32)
intc Identical to C int (normally int32 or int64)
intp Integer used for indexing (same as C ssize_t; normally either int32 or int64)
int8 Byte (–128 to 127)
int16 Integer (–32768 to 32767)
int32 Integer (–2147483648 to 2147483647)
int64 Integer (–9223372036854775808 to 9223372036854775807)
uint8 Unsigned integer (0 to 255)
uint16 Unsigned integer (0 to 65535)
uint32 Unsigned integer (0 to 4294967295)
uint64 Unsigned integer (0 to 18446744073709551615)
float_ Shorthand for float64
float16 Half-precision float: sign bit, 5 bits exponent, 10 bits mantissa
float32 Single-precision float: sign bit, 8 bits exponent, 23 bits mantissa
float64 Double-precision float: sign bit, 11 bits exponent, 52 bits mantissa
complex_ Shorthand for complex128
complex64 Complex number, represented by two 32-bit floats
complex128 Complex number, represented by two 64-bit floats

Np.array Attributes

In [73]:
np.random.seed(0) #set seed
s = np.random.randint(1,10,(3,3))
s #create 3*3 matrix of random array
Out[73]:
array([[6, 1, 4],
       [4, 8, 4],
       [6, 3, 5]])
In [68]:
# Each array has attributes
print("s ndim: ", s.ndim,'where dim is number of dimension and in this case is R2') #access attribute by (array.method)
print("s shape: ", s.shape,' shape is 3*3 matrix') 
print("s size: ", s.size,'size is total size shape times shape') 
s ndim:  2 where dim is number of dimension and in this case is R2
s shape:  (3, 3)  shape is 3*3 matrix
s size:  9 size is total size shape times shape

Accessing Element

In [77]:
#access element in matrix 
s[2,1] # get the third row and second column (zero-indexing)
Out[77]:
3
In [78]:
#modify last row to 1,2,3
s[2,:] = [1,2,3] ;s
Out[78]:
array([[6, 1, 4],
       [4, 8, 4],
       [1, 2, 3]])
In [93]:
# x[start:stop:step]
k = np.random.randint(1,20,20) 
print(k)
print(k[10::2]) # from mid to end and step 2 each time
print(k[::-1]) # reverse all element
[16  8  8 12 18  7  8 19 12 18 10 15 10 19  1 10 12 18 10  1]
[10 10  1 12 10]
[ 1 10 18 12 10  1 19 10 15 10 18 12 19  8  7 18 12  8  8 16]
In [94]:
#reverse matrix
s[::-1,::-1]
Out[94]:
array([[3, 2, 1],
       [4, 8, 4],
       [4, 1, 6]])

Copying of array

Normally Numpy return view So Copy need to to define explicitly

In [122]:
s_copy = s[1:,:2].copy()
s_copy
Out[122]:
array([[4, 8],
       [1, 2]])

Reshape of Arrays

In [135]:
#reshape tp 1*9 matrix
print("original shape is",s.shape)
print('new shape is ' ,s.reshape(1,9).shape)
original shape is (3, 3)
new shape is  (1, 9)
In [140]:
print(np.array([1,2,5,3]))
print('to ')
print(np.array([1,2,5,3]).reshape(4,1))
[1 2 5 3]
to 
[[1]
 [2]
 [5]
 [3]]

Concatenation of arrays

using
np.concatenate
np.vstack
np.hstack

In [154]:
x = np.array([1,2,3,4,5])
y = np.array([6,7,8,9,10])
print(np.concatenate([x,y])) # have to be same dimension
[ 1  2  3  4  5  6  7  8  9 10]
In [159]:
print(np.concatenate([s,s]))
print('concat along second axis')
print(np.concatenate([s,s],axis= 1)) # along the second axis (zero-index)
[[6 1 4]
 [4 8 4]
 [1 2 3]
 [6 1 4]
 [4 8 4]
 [1 2 3]]
concat along second axis
[[6 1 4 6 1 4]
 [4 8 4 4 8 4]
 [1 2 3 1 2 3]]
In [173]:
# Different dimension use np.vstack(same no. of columns) and np.hstack(same no. of rows)
x = np.array([1,2,3])
y = np.array([[6,7,8],
              [4,5,6]])
print(np.vstack([x,y])) #"vertical concat"
print(np.hstack([y,y])) #"horizontal concat"
[[1 2 3]
 [6 7 8]
 [4 5 6]]
[[6 7 8 6 7 8]
 [4 5 6 4 5 6]]

Splitting of arrays

np.split
np.hsplit
np.vsplit

In [228]:
# hsplit = split columns wise
np.hsplit(y,3)[0] # split y into 3 column and return the view of the first onw
Out[228]:
array([[6],
       [4]])
In [222]:
# split = row wise
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5]) # x1 = 0:3 (inclusive : exclusive) and x2 = 3:5 and x3 = 5: end
print(x1, x2, x3)
[1 2 3] [99 99] [3 2 1]

NP.Operation

Because list is so flexible to store varios type in python, its also slow for computation (iteration)
Numpy.array is a lot faster if we compute by vectorized operation

In [236]:
# vectorized operation
print(y)
print('top is y and below is y+4')
print(y+4)
[[6 7 8]
 [4 5 6]]
top is y and below is y+4
[[10 11 12]
 [ 8  9 10]]
In [242]:
k = np.arange(1,11).reshape(2,5)
2**k #vectorized operation on multi-dimension
Out[242]:
array([[   2,    4,    8,   16,   32],
       [  64,  128,  256,  512, 1024]], dtype=int32)
In [246]:
#Exponents and logarithms
x = [1, 2, 3]
print("x =", x)
print("e^x =", np.exp(x))
print("2^x =", np.exp2(x))
print("3^x =", np.power(3, x))
x = [1, 2, 3]
e^x = [ 2.71828183  7.3890561  20.08553692]
2^x = [2. 4. 8.]
3^x = [ 3  9 27]
In [247]:
x = [1, 2, 4, 10]
print("x =", x)
print("ln(x) =", np.log(x))
print("log2(x) =", np.log2(x))
print("log10(x) =", np.log10(x))
x = [1, 2, 4, 10]
ln(x) = [0.         0.69314718 1.38629436 2.30258509]
log2(x) = [0.         1.         2.         3.32192809]
log10(x) = [0.         0.30103    0.60205999 1.        ]

Specify Output

In [262]:
# can specify where to store output given output already predefined
x = np.arange(1,6) # array of 1 to 5
x_2 = np.empty(5)
np.multiply(x,3,out = x_2) # multuply 3 to every elements and store in x_2 
x_2
Out[262]:
array([ 3.,  6.,  9., 12., 15.])

Aggregates

In [274]:
# reduce() = repeatedly apply the operation until we have one result left and return it
m = np.arange(1,6)
np.add.reduce(m)
Out[274]:
15
In [271]:
# sum all have same result
np.arange(1,6).sum()
Out[271]:
15
In [276]:
# Accumulate // cumulative
print(np.add.accumulate(m))
print(np.multiply.accumulate(m))
[ 1  3  6 10 15]
[  1   2   6  24 120]

Summary Statistic

In [289]:
# Max, Mix
print('max value in m is',np.max(x),'or',x.max()) # Numpy ways much faster than Python way which is max(m) or min(m)
max value in m is 5 or 5
In [300]:
# on muti-dimensional
k = np.ones(12,dtype= 'int32').reshape(3,4) ; print(k)
print(k.sum()) # return sum of entire array
print(k.sum(axis = 0)) #sum by each column (specify by axis columns = 0 and row = 1(zero-index))
print(k.sum(axis = 1)) #sum by each rows
[[1 1 1 1]
 [1 1 1 1]
 [1 1 1 1]]
12
[3 3 3 3]
[4 4 4]

All aggregation functions

Note can use with axis parameter to operate row or column wise ex np.sum(x,axis=0) which is sum array column wise

Function Name NaN-safe Version Description
np.sum np.nansum Compute sum of elements
np.prod np.nanprod Compute product of elements
np.mean np.nanmean Compute median of elements
np.std np.nanstd Compute standard deviation
np.var np.nanvar Compute variance
np.min np.nanmin Find minimum value
np.max np.nanmax Find maximum value
np.argmin np.nanargmin Find maximum value
np.argmax np.nanargmax Find index of minimum value
np.median np.nanmedian Find index of maximum value
np.percentile np.nanper Compute rank-based statistics of elements
np.any N/A Evaluate whether any elements are true
np.all N/A Evaluate whether all elements are true
In [16]:
# Case example
import pandas as pd
data = pd.read_csv('data/president_heights.csv',index_col =0) #start with same path as this notebook store
data.head()
Out[16]:
name height(cm)
order
1 George Washington 189
2 John Adams 170
3 Thomas Jefferson 189
4 James Madison 163
5 James Monroe 183
In [22]:
# find summary statistic
heights = data['height(cm)']
print("Mean height: ", heights.mean())
print("Standard deviation:", heights.std())
print("Minimum height: ", heights.min())
print("Maximum height: ", heights.max())
print("25th percentile: ", np.percentile(heights, 25))
print("Median: ", np.median(heights))
print("75th percentile: ", np.percentile(heights, 75))
Mean height:  179.73809523809524
Standard deviation: 7.015868855358296
Minimum height:  163
Maximum height:  193
25th percentile:  174.25
Median:  182.0
75th percentile:  183.0

Computation on Arrays: Broadcasting

Broadcasting is simply a set of rules for applying binary ufuncs (addition, subtraction, multiplication, etc.) on arrays of different sizes.

In [31]:
a = np.ones((3,3))
b = np.random.randint(1,20,3)
print(a,'has shape of ',a.shape)
print(b,'has shape of ',b.shape)
[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]] has shape of  (3, 3)
[12 12 14] has shape of  (3,)
In [32]:
# a can combine with b by broadcasting(strech) b from 1*3 matrix to 3*3 so that they can be combine
a+b
Out[32]:
array([[13., 13., 15.],
       [13., 13., 15.],
       [13., 13., 15.]])
In [34]:
# both can be boardcast at the sametime
a =np.arange(3)
b =np.arange(3).reshape(3,1)
print(a)
print(b)
a+b
[0 1 2]
[[0]
 [1]
 [2]]
Out[34]:
array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4]])
In [13]:
#case example using broadcasting
x =  np.random.randint(10,20,[10,3])
print(x)
x_mean = x.mean(axis = 0) ; x_mean
[[10 16 19]
 [13 11 17]
 [15 11 10]
 [18 12 15]
 [10 18 18]
 [19 15 12]
 [19 14 16]
 [13 14 15]
 [17 12 19]
 [13 17 13]]
Out[13]:
array([14.7, 14. , 15.4])
In [15]:
# we can find deviation by broadcasting the x_mean from 1*3 to 10*3 matrix
deviation = x - x_mean
deviation.mean(0)
Out[15]:
array([ 7.10542736e-16,  0.00000000e+00, -3.55271368e-16])

Comparison and Boolean Mask

Masking comes up when you want to extract, modify, count, or otherwise manipulate values in an array based on some criterion: for example, you might wish to count all values greater than a certain value, or perhaps remove all outliers that are above some threshold.

Comparison Operator as ufuncs

In [37]:
x = np.arange(1,11) #array of 1 -10
print(x)
print(x<3)
print(x>=3)
print(x != 3)
[ 1  2  3  4  5  6  7  8  9 10]
[ True  True False False False False False False False False]
[False False  True  True  True  True  True  True  True  True]
[ True  True False  True  True  True  True  True  True  True]
In [40]:
# element-by-element comparison of two arrays
(2*x) == (x**2)
Out[40]:
array([False,  True, False, False, False, False, False, False, False,
       False])

Boolean array

Counting entries

In [68]:
# Count all entries (different way of counting)
y = np.arange(1,7).reshape(3,2)
print(y)
print(y.size)
print(np.count_nonzero(y))
# count conditional where y<6
print(np.count_nonzero(y<3))
print(np.sum(y<3)) #in this case, False is interpreted as 0, and True is interpreted as 1
print(np.sum(y<3,axis= 0)) # aggregation functions can define axis for row or column wise operation
[[1 2]
 [3 4]
 [5 6]]
6
6
2
2
[1 1]
In [73]:
# check any and all values
print(np.any(y>6)) # are there any value in y that is more than 6
np.all(y<10) # Are all value in y less than 10
False
Out[73]:
True

Boolean operators

Operator Equivalent ufunc Description
& np.bitwise_and and
| np.bitwise_or or
^ np.bitwise_xor bit wise exclusive or
~ np.bitwise_not not

*note xor rules will examine in bit mode is 0^0 =0 and 1^1=0 and 0^1 = 1 and 1^0 =0

In [26]:
# Load case example from Seattle file // do summary statistic
data = pd.read_csv('data/Seattle2014.csv')['PRCP'].values
inches = data/254
inches.shape
Out[26]:
(365,)
In [76]:
# ((condition1) & (condition2))  == (()and()) 
np.sum((inches<1) & (inches>0.5))
Out[76]:
29
In [79]:
print("Number days without rain: ", np.sum(inches==0) )
print("Number days with rain: ", np.sum(inches>0))
print("Days with more than 0.5 inches:", np.sum(inches>0.5) )
print("Rainy days with < 0.1 inches :",np.sum((inches>0) & (inches<0.1)) )
Number days without rain:  215
Number days with rain:  150
Days with more than 0.5 inches: 37
Rainy days with < 0.1 inches : 46

Boolean Arrays as Masks

to select particular subsets of the data themselves.

In [84]:
x = np.random.randint(1,10,12).reshape(3,4)
print(x)
[[9 4 4 8]
 [8 2 5 2]
 [4 6 1 7]]
In [90]:
#obtain subset of x where values is less that 5
#step 1 boolean mask
print(x<5)
#step 2 return every value that index is True (which mean less than 5)
x[x<5] # returned a one-dimensional array with all values that meet  condition; in other words, all the values in positions at which the mask array is True.
[[False  True  True False]
 [False  True False  True]
 [ True False  True False]]
Out[90]:
array([4, 4, 2, 2, 4, 1])
In [123]:
rainy_data = inches[inches>0]  # return all value that is more than 0
# construct a mask of all summer days (June 21st is the 172nd day)
summer_day = (np.arange(365) - 172 <90)  & (np.arange(365) - 172 >0) # list range from 1 to 365 and -172 then summer day will be between 1 and 90
print("Median precip on rainy days in 2014 (inches): ", np.median(inches[inches>0]))
print("Median precip on summer days in 2014 (inches): ",np.median(inches[summer_day]))
print("Maximum precip on summer days in 2014 (inches): ",np.max(inches[summer_day]))
print("Median precip on non-summer and rainy days (inches):",np.median(inches[(inches>0) & ~summer_day]))
Median precip on rainy days in 2014 (inches):  0.19488188976377951
Median precip on summer days in 2014 (inches):  0.0
Maximum precip on summer days in 2014 (inches):  0.8503937007874016
Median precip on non-summer and rainy days (inches): 0.20078740157480315

Fancy Indexing P.78

In [130]:
r = np.random.RandomState(42) # set seed to this variable
x = r.randint(100, size = 10)
x
Out[130]:
array([51, 92, 14, 71, 60, 20, 82, 86, 74, 74])
In [140]:
# normally we access value by index
print(x[[0,1,2,3]] )# return value in index 0,1,2,3
#same as
ind = [0,1,2,3]
x[ind]
[51 92 14 71]
Out[140]:
array([51, 92, 14, 71])
In [139]:
#with fancy index result copy shape of the index
ind2 = np.array([[0,1],
                 [2,3]])
x[ind2]
Out[139]:
array([[51, 92],
       [14, 71]])
In [166]:
#another example
y = np.arange(12).reshape(3,4)
print(y)
y[[0,1,2],[2,1,3]] # mean return array of first (row0,col2) ,(row1,col1) ,(row2,col3) by matching row and column
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
Out[166]:
array([ 2,  5, 11])

Combinded Index

In [167]:
# Slicing with fancy index
y[1:,[3,1,2]] #return everything from row 1 that is column 3 ,1 and 2 respectively
Out[167]:
array([[ 7,  5,  6],
       [11,  9, 10]])

Modify array with Fancy index

In [180]:
x = np.arange(10)
i = np.array([2,1,8,4])
x[i] = 99
print(x)
x[i] -= 10
print(x)
[ 0 99 99  3 99  5  6  7 99  9]
[ 0 89 89  3 89  5  6  7 89  9]

Sorting Arrays

np.sort ~ O N log N which is much more efficient that .sort() in normal python
np.argsort ~ returns the indices of the sorted elements

note for pd.DataFrame use $.sort\_values()$

In [181]:
np.sort(x) # not inplace
x.sort() #sort inplace
Out[181]:
array([ 0,  3,  5,  6,  7,  9, 89, 89, 89, 89])
In [191]:
x = np.array([2, 1, 4, 3, 5])
i = np.argsort(x) # return the index of sorting array
x[i] # which can then be use to sort with fany index
Out[191]:
array([1, 2, 3, 4, 5])

sorting along row or column

In [62]:
x = np.random.randint(1,100,20).reshape(5,4)
x
Out[62]:
array([[90, 58, 82, 73],
       [70, 34, 45, 78],
       [25, 51, 31, 19],
       [44, 36, 10, 54],
       [16,  1, 29, 53]])
In [67]:
np.sort(x,axis = 0) # axis =0 is sort by column
Out[67]:
array([[16,  1, 10, 19],
       [25, 34, 29, 53],
       [44, 36, 31, 54],
       [70, 51, 45, 73],
       [90, 58, 82, 78]])

Partial Sorts: Partitioning

$np.partition()$ function.
np.partition takes an array and a number K; the result is a new array with the smallest K values to the left of the partition, and the remaining values to the right, in arbitrary order
$np.argpartition$ ~ return index

In [203]:
x = np.array([4,521,21,9,95,77,2])
# basically return 'k' smallest number to the leftmost and other number in random(arbitary) order
np.partition(x,2) # return two smallest which is 2 and 4 in the left most other are in random order
Out[203]:
array([  2,   4,   9,  21,  95,  77, 521])

Structure array

In [224]:
data = np.zeros(4)
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
                         'formats':('U10', 'i4', 'f8')})
data
Out[224]:
array([('', 0, 0.), ('', 0, 0.), ('', 0, 0.), ('', 0, 0.)],
      dtype=[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])
In [219]:
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)
[('Alice', 25, 55. ) ('Bob', 45, 85.5) ('Cathy', 37, 68. )
 ('Doug', 19, 61.5)]
In [222]:
# Get all names
print(data['name'])
# Get names where age is under 30
print(data[data['age'] < 30]['name'])
['Alice' 'Bob' 'Cathy' 'Doug']
['Alice' 'Doug']

Pandas

focus on the mechanics of using Series, DataFrame, and related structures effectively.

In [2]:
import pandas
pandas.__version__
Out[2]:
'0.25.3'

Pandas objects

three fundamental Pandas data structures: the Series, DataFrame, and Index.

Panda Series object

A Pandas Series is a one-dimensional array of indexed data.
The essential difference with np.array is the presence of the index: while the NumPy array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values. (index can be define manually as anything (float,string.etc))

$pd.Series(data, index=index)$

where index is an optional argument, and data can be one of many entities.

In [227]:
p = pd.Series([100,200,300,400])
print(p)
p.values # .values are simply np.array
0    100
1    200
2    300
3    400
dtype: int64
Out[227]:
array([100, 200, 300, 400], dtype=int64)
In [231]:
# index can be anything ,also string
p = pd.Series([100,200,300.5,400], 
             index = ['a','b','c','55'])
p
Out[231]:
a     100.0
b     200.0
c     300.5
55    400.0
dtype: float64
In [101]:
# construct from python dictionary
p_dict = {'California': 38332525,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(p_dict)
population
Out[101]:
California    38332525
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64
In [238]:
# data can repeatedly fill it self
pd.Series('Copy me', index=[1, 2, 3])
Out[238]:
1    Copy me
2    Copy me
3    Copy me
dtype: object

Pandas DataFrame Objects

DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.

In [102]:
#create DataFrame by combine pd.Series and dictionay in dictionary format
area = {'California': 55,
'Texas': 56,
'New York': 54,
'Florida': 89,
'Illinois': 123}
dat = pd.DataFrame({'area' : area,
                    'pop'  : population}) # add column
dat
Out[102]:
area pop
California 55 38332525
Florida 89 19552860
Illinois 123 12882135
New York 54 19651127
Texas 56 26448193
In [244]:
print(dat.index) #access index(row)
print(dat.columns) #access attribute(columns)
Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')
Index(['area', 'pop'], dtype='object')
In [245]:
dat['area'] #access attribute(columns)
Out[245]:
California     55
Florida        89
Illinois      123
New York       54
Texas          56
Name: area, dtype: int64
In [256]:
#create df specify columns and index
D = pd.DataFrame(['Apple','Orange',"Pineapple"], index = [1,2,3], columns = ['Fruit'])
D
Out[256]:
Fruit
1 Apple
2 Orange
3 Pineapple
In [257]:
# create specify index from series
purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])
df
Out[257]:
Name Item Purchased Cost
Store 1 Chris Dog Food 22.5
Store 1 Kevyn Kitty Litter 2.5
Store 2 Vinod Bird Seed 5.0

Pandas Index Objects

can be thought of either as an immutable array or as an ordered set

Data Indexing and Selection

Data selection in pd.Series

In [269]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                  index=['a', 'b', 'c', 'd'])
print(data)
data['b'] # access value with df['index']
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64
Out[269]:
0.5
In [275]:
# examine index in Series / DataFrame
print('c' in data)
print(data.keys())
print(list(data.items()))
True
Index(['a', 'b', 'c', 'd'], dtype='object')
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
In [276]:
#add value "row wise"
data['e'] = 1.25
data
Out[276]:
a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64
In [282]:
# Slicing
print(data['b':'d']) # slice from index
print(data[1:4])     # slicing by implicit integer index (zero- index and (inclusive, exclusive)) 
b    0.50
c    0.75
d    1.00
dtype: float64
b    0.50
c    0.75
d    1.00
dtype: float64
In [283]:
# masking
data[(data > 0.3) & (data < 0.8)]
Out[283]:
b    0.50
c    0.75
dtype: float64

Indexers: loc, iloc, and ix

In [294]:
# iloc is implicit index (zero index)
data.iloc[0]
Out[294]:
0.25
In [301]:
# loc is explicit index (non-zero index)
data.loc['a']
Out[301]:
0.25

Data Selection in DataFrame

In [103]:
print(dat['area']) # access via dictionary style
dat
California     55
Florida        89
Illinois      123
New York       54
Texas          56
Name: area, dtype: int64
Out[103]:
area pop
California 55 38332525
Florida 89 19552860
Illinois 123 12882135
New York 54 19651127
Texas 56 26448193
In [313]:
dat.area # access via attribute style (but not prefer)
Out[313]:
California     55
Florida        89
Illinois      123
New York       54
Texas          56
Name: area, dtype: int64
In [312]:
dat.area is dat['area']
Out[312]:
True
In [104]:
dat['density'] = dat['area'] / dat['pop']
dat
Out[104]:
area pop density
California 55 38332525 0.000001
Florida 89 19552860 0.000005
Illinois 123 12882135 0.000010
New York 54 19651127 0.000003
Texas 56 26448193 0.000002

DataFrame as two-dimensional array

In [319]:
dat.T # Transpose
Out[319]:
California Florida Illinois New York Texas
area 5.500000e+01 8.900000e+01 1.230000e+02 5.400000e+01 5.600000e+01
pop 3.833252e+07 1.955286e+07 1.288214e+07 1.965113e+07 2.644819e+07
density 1.434813e-06 4.551764e-06 9.548107e-06 2.747934e-06 2.117347e-06
In [322]:
# index like in np.array with df.values to extract array
dat.values[0] # get first row 
Out[322]:
array([5.50000000e+01, 3.83325250e+07, 1.43481286e-06])

Indexers: loc, iloc, and ix

In [326]:
dat.iloc[:3,1:]
Out[326]:
pop density
California 38332525 0.000001
Florida 19552860 0.000005
Illinois 12882135 0.000010
In [105]:
dat.loc[:'Illinois','pop':]
Out[105]:
pop density
California 38332525 0.000001
Florida 19552860 0.000005
Illinois 12882135 0.000010
In [329]:
#The ix indexer allows a hybrid of these two approaches:
dat.ix[:3,'pop']
C:\Users\PK\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  
C:\Users\PK\Anaconda3\lib\site-packages\pandas\core\indexing.py:961: FutureWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  return getattr(section, self.name)[new_key]
Out[329]:
California    38332525
Florida       19552860
Illinois      12882135
Name: pop, dtype: int64
In [106]:
dat.loc[dat['area']>80,['area']] # using with mask (return only area columns)
Out[106]:
area
Florida 89
Illinois 123
In [335]:
# modify value to 5
dat.density['Florida'] = 5
# is same as
dat.loc['Florida','density']= 5
# is same as
dat.iloc[1,2] = 5
C:\Users\PK\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
In [336]:
dat
Out[336]:
area pop density
California 55 38332525 0.000001
Florida 89 19552860 5.000000
Illinois 123 12882135 0.000010
New York 54 19651127 0.000003
Texas 56 26448193 0.000002
In [358]:
# using fancy index to specify the column and row you want
dat.loc[['Illinois','California'],['pop','density']]           
Out[358]:
pop density
Illinois 12882135 0.000010
California 38332525 0.000001

Additional index conventions

indexing refers to columns,
slicing refers to rows

In [30]:
#combind masking with index
dat[dat['pop']>15882135].loc['Florida':'New York','pop':'density']
Out[30]:
pop density
Florida 19552860 0.000005
New York 19651127 0.000003

Operating on Data in Pandas

Index alignment in Series

In [3]:
area = pd.Series({'Alaska' : 170,
                  'Texas' : 180,
                  'California': 423967}, 
                 name='area')
population = pd.Series({'California': 38332521, 
                        'Texas': 26448193,
                        'New York': 19651127}, 
                       name='population')
area, population
Out[3]:
(Alaska           170
 Texas            180
 California    423967
 Name: area, dtype: int64, California    38332521
 Texas         26448193
 New York      19651127
 Name: population, dtype: int64)
In [7]:
population + area #The resulting array contains the union of indices of the two input arrays
Out[7]:
Alaska               NaN
California    38756488.0
New York             NaN
Texas         26448373.0
dtype: float64
In [5]:
area.index | population.index
Out[5]:
Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')
In [6]:
# add condition for empty) value = 0 then add
area.add(population,fill_value=0)
Out[6]:
Alaska             170.0
California    38756488.0
New York      19651127.0
Texas         26448373.0
dtype: float64

Mapping between Python operators and Pandas methods
Note ~ can use condition like fill_value for empty value

Python operator Pandas method(s)
+ add()
- sub()
* mul(), multiply()
/ truediv(),div()
// floordiv()
% mod()
** pow()
In [15]:
# operate row - column wise in DataFrame
D = pd.DataFrame(np.arange(1,13).reshape(3,4), columns=['A', 'B','C','D'])
D
Out[15]:
A B C D
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
In [18]:
# row wise (minus everything with first row, row wise)
D - D.iloc[0]
Out[18]:
A B C D
0 0 0 0 0
1 4 4 4 4
2 8 8 8 8
In [24]:
#column wise (minus everything with 'B' column, column wise) # have to use method
D.sub(D['B'],axis=0)
Out[24]:
A B C D
0 -1 0 1 2
1 -1 0 1 2
2 -1 0 1 2

Handling Missing Data

we’ll refer to missing data in general as null, NaN, or NA values.

Strategies

  • Masking apoarch
    Indicate null value with Boolean array
  • Sentinel approach
    Indicate a missing integer with -9999 or NaN

Sentinel value

  • Nan (np.nan)
In [16]:
# np.nan has float64 type
data_with_nan  = np.array([1,2,3,np.nan,4])
data_with_nan.dtype
Out[16]:
dtype('float64')
In [18]:
# anything + - * / with nan = nan
data_with_nan + 5
Out[18]:
array([ 6.,  7.,  8., nan,  9.])
In [25]:
# need to use special aggregate function like np.nanmax, np.nansum, etc
print(data_with_nan.max())
np.nanmax(data_with_nan)
nan
Out[25]:
4.0

Sentinel value

  • Null values

    tools
    $isnull()$ ~ Generate a Boolean mask indicating missing values
    $notnull()$ ~ Opposite of isnull()
    $dropna()$ ~ Return a filtered version of the data
    $fillna()$ ~ Return a copy of the data with missing values filled or imputed

Detecting null values

In [34]:
data = pd.Series([1,None,2,np.nan])
print(pd.isnull(data))
# same as
print(data.isnull()) # True for null (good for filter for null values)
0    False
1     True
2    False
3     True
dtype: bool
0    False
1     True
2    False
3     True
dtype: bool
In [35]:
# opposite to 
data.notnull() # False for null (good for filter for 'not' null values)
Out[35]:
0     True
1    False
2     True
3    False
dtype: bool
In [38]:
data[data.notnull()] # filter for 'not' null values
Out[38]:
0    1.0
2    2.0
dtype: float64

Dropping null values

In [44]:
print(data.dropna() == data[data.notnull()])
print(data.dropna() )
0    True
2    True
dtype: bool
0    1.0
2    2.0
dtype: float64
In [96]:
# for data frame , can use fillna()
df = pd.DataFrame([[1, np.nan, 2],
                   [2,    3,   5],
                   [np.nan, 4, 6]])
df
Out[96]:
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
In [52]:
# We cannot drop single values from a DataFrame; we can only drop full rows or full columns
df.dropna() # drop if any na in row
Out[52]:
0 1 2
1 2.0 3.0 5
In [61]:
df.dropna(axis= 'rows')
Out[61]:
0 1 2
1 2.0 3.0 5
In [71]:
df.dropna(axis= 'columns') #diff from numpy axis, pd axis = 1 us column and 0 is row so better say explicitly columns or rows
Out[71]:
2
0 2
1 5
2 6
In [79]:
# control dropping rows or columns with all NA values, or a majority of NA values.
# parameter  'thresh' or 'how'
df[3] = np.nan 
print(df)
df.dropna(how='all', axis='columns')
     0    1  2   3
0  1.0  NaN  2 NaN
1  2.0  3.0  5 NaN
2  NaN  4.0  6 NaN
Out[79]:
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
In [81]:
# thresh parameter lets you specify a minimum number of (non-null values) to kept
df.dropna(thresh=3) # kept only if row have 3 or more of non-null
Out[81]:
0 1 2 3
1 2.0 3.0 5 NaN

Remove NAsand shift (push) the values up

In [2]:
# sample df
pushup_df = pd.DataFrame({'Name':['Samantha','Julia','Meera','Ashely','Ketty'],
             'Occupation':['Doctor','Actor','Doctor','Singer','Actor']})
pushup_df
Out[2]:
Name Occupation
0 Samantha Doctor
1 Julia Actor
2 Meera Doctor
3 Ashely Singer
4 Ketty Actor
In [9]:
# categorize by Occupation
pivot_1 = pushup_df.pivot(columns='Occupation') 
pivot_1# many NaN
Out[9]:
Name
Occupation Actor Doctor Singer
0 NaN Samantha NaN
1 Julia NaN NaN
2 NaN Meera NaN
3 NaN NaN Ashely
4 Ketty NaN NaN
In [10]:
# remove only NaNs and shift other value up
pivot_1.apply(lambda x : pd.Series(x.dropna().values)).reset
Out[10]:
Name
Occupation Actor Doctor Singer
0 Julia Samantha Ashely
1 Ketty Meera NaN

Filling null values

In [105]:
# fillna() method  returns a 'copy' of the array with the null values replaced.
df.fillna(0)
Out[105]:
0 1 2 3
0 1.0 0.0 2 0.0
1 2.0 3.0 5 0.0
2 0.0 4.0 6 0.0
In [93]:
# can fill using mask with .isna() will be 'inplace'
df[df.isna()] = 0
df
Out[93]:
0 1 2 3
0 1.0 0.0 2 0.0
1 2.0 3.0 5 0.0
2 0.0 4.0 6 0.0
In [114]:
df
Out[114]:
0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
In [99]:
# forward fill (copy value from previous row or column)
df.fillna(method='ffill') #rows (copy from next row)
Out[99]:
0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 2.0 4.0 6 NaN
In [113]:
# backward fill (copy value from following row or column)
df.fillna(method='bfill',axis='columns')  #column  (copy from following column)
Out[113]:
0 1 2 3
0 1.0 2.0 2.0 NaN
1 2.0 3.0 5.0 NaN
2 4.0 4.0 6.0 NaN

Handle Outliers

One easy way to remove these all at once is to cut outliers; we’ll do this via a robust sigma-clipping operation:1

In [6]:
#Sample data
!curl -O https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv
births = pd.read_csv('births.csv')
print(births.shape)
births.head(3)
(15547, 5)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
 61  258k   61  159k    0     0   177k      0  0:00:01 --:--:--  0:00:01  177k
100  258k  100  258k    0     0   223k      0  0:00:01  0:00:01 --:--:--  223k
Out[6]:
year month day gender births
0 1969 1 1.0 F 4046
1 1969 1 1.0 M 4440
2 1969 1 2.0 F 4454
In [7]:
quartiles = np.percentile(births['births'], [25, 50, 75])
mu = quartiles[1]
sig = 0.74 * (quartiles[2] - quartiles[0]) # a robust estimate of the sample mean, where 0.74 comes from the interquartile range of a Gaussian distribution
# then use query to filter out rows outside these values
births_1 = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)')
births_1.shape # row were reduce from 15547 to 14610
Out[7]:
(14610, 5)

Another way is normalization to Z-score and filter in value with less than 3 and more then -3, Z score

In [11]:
from scipy import stats
births['zscore'] = np.abs(stats.zscore(births.births)) # assign new column of absolute Z score

births_2 = births[(births.zscore < 3) ]
births.drop('zscore', axis='columns', inplace=True)
births_2.drop('zscore', axis='columns', inplace=True)
births_2.shape # drop from 15547 to 15067
C:\Users\PK\Anaconda3\lib\site-packages\pandas\core\frame.py:4117: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
Out[11]:
(15067, 5)

Hierarchical Indexing

aka multi indexing

In [45]:
index = [('California', 2000), ('California', 2010),
            ('New York', 2000), ('New York', 2010),
            ('Texas', 2000), ('Texas', 2010)]

populations = [33871648, 37253956,
                18976457, 19378102,
                20851820, 25145561]

pop = pd.Series(populations, index=index)
pop
Out[45]:
(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64
In [46]:
# use panda multiIndex to get all 2010
# create multiIndex
index = pd.MultiIndex.from_tuples(index)
index
Out[46]:
MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )
In [47]:
# reindex our series with this MultiIndex, we see the hierarchical representation of the data:
pop = pop.reindex(index)
pop
Out[47]:
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64
In [132]:
#access data
pop[:,2010]
Out[132]:
California    37253956
New York      19378102
Texas         25145561
dtype: int64
In [134]:
# convert a multiplyindexed Series into a conventionally indexed DataFrame with unstack()
pop_df = pop.unstack()
pop_df
Out[134]:
2000 2010
California 33871648 37253956
New York 18976457 19378102
Texas 20851820 25145561
In [135]:
# convert back to multi index
pop_df.stack()
Out[135]:
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64
In [136]:
pop_df = pd.DataFrame({'total': pop,
                       'under18': [9267089, 9284094,
                                    4687374, 4318033,
                                    5906301, 6879014]})
pop_df
Out[136]:
total under18
California 2000 33871648 9267089
2010 37253956 9284094
New York 2000 18976457 4687374
2010 19378102 4318033
Texas 2000 20851820 5906301
2010 25145561 6879014
In [144]:
# fraction of people under 18 by year, given the above data
(pop_df['under18'] / pop_df['total']).unstack()
Out[144]:
2000 2010
California 0.273594 0.249211
New York 0.247010 0.222831
Texas 0.283251 0.273568

Methods of MultiIndex Creation

In [158]:
# construct
df = pd.DataFrame(np.random.randint(1,100,(4,2)),
                  index = [['a','a','b','b'], [1,2,1,2]] , # multi Index (index = [[first index], [ second index]])
                  columns= ['dat1', 'dat2'])
df
Out[158]:
dat1 dat2
a 1 24 64
2 36 17
b 1 93 81
2 72 3

MultiIndex level names

In [159]:
pop
Out[159]:
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64
In [55]:
#assign name
pop.index.names = ['state', 'year']
pop
Out[55]:
state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

MultiIndex for columns

In [44]:
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                    names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                    names=['subject', 'type'])
data = np.round(np.random.randn(4, 6), 1)+20
health_data = pd.DataFrame(data, index=index, columns=columns)
print(data)
health_data
[[19.  18.9 20.4 20.1 19.7 18.6]
 [18.9 20.  19.8 18.7 19.7 19.5]
 [20.2 19.6 20.5 18.8 20.6 19.5]
 [19.7 20.8 20.5 19.3 19.9 19. ]]
Out[44]:
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 19.0 18.9 20.4 20.1 19.7 18.6
2 18.9 20.0 19.8 18.7 19.7 19.5
2014 1 20.2 19.6 20.5 18.8 20.6 19.5
2 19.7 20.8 20.5 19.3 19.9 19.0
In [34]:
# access data with data[column or names][row or index]
health_data['Guido','HR'][2013,1] 
Out[34]:
19.9

Indexing and slicing in MultiIndex

In [11]:
pop
Out[11]:
state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64
In [12]:
#indexing 
print(pop['New York',2010])
19378102
In [15]:
#Partial slicing on lower levels
pop.loc[:,2000]
Out[15]:
state
California    33871648
New York      18976457
Texas         20851820
dtype: int64
In [16]:
#masking
pop[pop>20851820]
Out[16]:
state       year
California  2000    33871648
            2010    37253956
Texas       2010    25145561
dtype: int64
In [25]:
#fancy index
print(pop[['California', 'Texas']])
print(pop[['California', 'Texas']][:,2010]) # filter on lower lever on chained []
state       year
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
dtype: int64
state
California    37253956
Texas         25145561
dtype: int64
In [28]:
health_data
Out[28]:
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 20.1 20.3 19.9 20.2 18.3 18.9
2 19.4 20.5 22.2 22.0 20.2 20.0
2014 1 21.3 19.6 21.5 20.3 19.0 20.1
2 19.9 18.7 19.7 20.1 19.1 19.5

Multiply indexed DataFrames

columns are primary in a DataFrame

In [40]:
# access data with data[column or names][row or index]
health_data['Sue']
Out[40]:
type HR Temp
year visit
2013 1 18.3 18.9
2 20.2 20.0
2014 1 19.0 20.1
2 19.1 19.5
In [41]:
health_data.iloc[:2, :2]
Out[41]:
subject Bob
type HR Temp
year visit
2013 1 20.1 20.3
2 19.4 20.5
In [51]:
health_data.loc[2013,['Bob','Sue']]
Out[51]:
subject Bob Sue
type HR Temp HR Temp
visit
1 20.1 20.3 18.3 18.9
2 19.4 20.5 20.2 20.0
In [52]:
# use an IndexSlice object,which Pandas provides for precisely this situation
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]
Out[52]:
subject Bob Guido Sue
type HR HR HR
year visit
2013 1 20.1 19.9 18.3
2014 1 21.3 21.5 19.0

Rearranging Multi-Indices

Sorted and unsorted indices

the MultiIndex slicing operations will fail if the index is not sorted

In [14]:
data = pd.Series(np.random.rand(6), index=pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]]))
data.index.names = ['char','int']
data
Out[14]:
char  int
a     1      0.993751
      2      0.902126
c     1      0.137149
      2      0.683556
b     1      0.605402
      2      0.322246
dtype: float64
In [28]:
# step of access data for Series(only 1 column)
data[['a','b']]['a'][1] == data['a'][1]
Out[28]:
True
In [29]:
# slicing reqire multiIndex to be sorted : sort_index()
data = data.sort_index()
data # a,c,b -> a,b,c sort by alphabet
Out[29]:
char  int
a     1      0.993751
      2      0.902126
b     1      0.605402
      2      0.322246
c     1      0.137149
      2      0.683556
dtype: float64
In [39]:
# now can slice
data[:'b']
Out[39]:
char  int
a     1      0.993751
      2      0.902126
b     1      0.605402
      2      0.322246
dtype: float64
In [41]:
pd.Series(np.random.rand(6), index=pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]],sortorder=2))
Out[41]:
a  1    0.845384
   2    0.775311
c  1    0.130972
   2    0.374819
b  1    0.287058
   2    0.022339
dtype: float64

Stacking and unstacking indices

In [51]:
health_data
Out[51]:
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 19.0 18.9 20.4 20.1 19.7 18.6
2 18.9 20.0 19.8 18.7 19.7 19.5
2014 1 20.2 19.6 20.5 18.8 20.6 19.5
2 19.7 20.8 20.5 19.3 19.9 19.0
In [52]:
health_data.stack()
Out[52]:
subject Bob Guido Sue
year visit type
2013 1 HR 19.0 20.4 19.7
Temp 18.9 20.1 18.6
2 HR 18.9 19.8 19.7
Temp 20.0 18.7 19.5
2014 1 HR 20.2 20.5 20.6
Temp 19.6 18.8 19.5
2 HR 19.7 20.5 19.9
Temp 20.8 19.3 19.0
In [56]:
pop
Out[56]:
state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64
In [64]:
pop.unstack(level=0)
Out[64]:
state California New York Texas
year
2000 33871648 18976457 20851820
2010 37253956 19378102 25145561
In [62]:
pop.unstack(level=1)
Out[62]:
year 2000 2010
state
California 33871648 37253956
New York 18976457 19378102
Texas 20851820 25145561

Index setting and resetting

turn the index labels into columns; with the reset_index method. (reset all multiIndex)

In [68]:
flat = pop.reset_index()
flat # look like real world data
Out[68]:
state year 0
0 California 2000 33871648
1 California 2010 37253956
2 New York 2000 18976457
3 New York 2010 19378102
4 Texas 2000 20851820
5 Texas 2010 25145561
In [81]:
# convert flat in to multiple index
flat.set_index(['state','year'])
Out[81]:
0
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561

Data Aggregations on Multi-Indices

for Multi-index. aggregation methods, such as mean(), sum(), and max(). can be passed a level parameter that controls which subset of the data the aggregate is computed on.

In [82]:
health_data
Out[82]:
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 19.0 18.9 20.4 20.1 19.7 18.6
2 18.9 20.0 19.8 18.7 19.7 19.5
2014 1 20.2 19.6 20.5 18.8 20.6 19.5
2 19.7 20.8 20.5 19.3 19.9 19.0
In [99]:
mean = health_data.mean(level='visit')
mean
Out[99]:
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
visit
1 19.6 19.25 20.45 19.45 20.15 19.05
2 19.3 20.40 20.15 19.00 19.80 19.25
In [103]:
mean.mean(level='type',axis= 'columns')
Out[103]:
type HR Temp
visit
1 20.066667 19.25
2 19.750000 19.55
In [97]:
health_data.median(level='type',axis='columns')
Out[97]:
type HR Temp
year visit
2013 1 19.7 18.9
2 19.7 19.5
2014 1 20.5 19.5
2 19.9 19.3

Combining Datasets: Concat and Append

In [113]:
# For convenience, we’ll define this function, which creates a DataFrame of a particular form that will be useful below:
def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c) + str(i) for i in ind] for c in cols}
    print(data)
    return pd.DataFrame(data, ind)
In [114]:
make_df('ABC', range(3))
{'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2'], 'C': ['C0', 'C1', 'C2']}
Out[114]:
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
In [121]:
# concat with pd.concat()
# By default, the concatenation takes place row-wise within the DataFrame (i.e.,axis=0).
A = make_df('ABC', range(3))
B = make_df('DEF', range(4))
pd.concat([A,B], axis=1).fillna(0) #concat column wise with axis = 1
{'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2'], 'C': ['C0', 'C1', 'C2']}
{'D': ['D0', 'D1', 'D2', 'D3'], 'E': ['E0', 'E1', 'E2', 'E3'], 'F': ['F0', 'F1', 'F2', 'F3']}
Out[121]:
A B C D E F
0 A0 B0 C0 D0 E0 F0
1 A1 B1 C1 D1 E1 F1
2 A2 B2 C2 D2 E2 F2
3 0 0 0 D3 E3 F3
In [123]:
# duplicate indices
A = make_df('ABC', range(2))
B = make_df('ABC', range(3))
pd.concat([A,B]) # concat will preserve value and replicate the duplicate row index
{'A': ['A0', 'A1'], 'B': ['B0', 'B1'], 'C': ['C0', 'C1']}
{'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2'], 'C': ['C0', 'C1', 'C2']}
Out[123]:
A B C
0 A0 B0 C0
1 A1 B1 C1
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
In [128]:
# verify that the indices in the result of pd.concat() do not overlap, you can specify the verify_integrity flag
try :
    pd.concat([A,B], verify_integrity=True)
except ValueError as e :
    print("error: ",e)
error:  Indexes have overlapping values: Int64Index([0, 1], dtype='int64')
In [133]:
# ignored index
pd.concat([A,B], ignore_index=True) # it will reindex to 0,1,2,3,4,...
Out[133]:
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A0 B0 C0
3 A1 B1 C1
4 A2 B2 C2
In [134]:
# or concat with multiIndex
pd.concat([A,B], keys=['X','Y'])
Out[134]:
A B C
X 0 A0 B0 C0
1 A1 B1 C1
Y 0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2

Concatenation with joins

In [138]:
# set up sample data frame
staff = pd.DataFrame([
    {'Name':'Kelly','Role':'admin'},
    {'Name':'John','Role':'data engineer'}
])
staff = staff.set_index('Name')

student = pd.DataFrame([
    {'Name':'John','School':'Engineer'},
    {'Name':'PK','School':'Business'}
])
student = student.set_index('Name')
print(staff)
print(student)
                Role
Name                
Kelly          admin
John   data engineer
        School
Name          
John  Engineer
PK    Business
In [141]:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
print(df5); print(df6); print(pd.concat([df5, df6]))
{'A': ['A1', 'A2'], 'B': ['B1', 'B2'], 'C': ['C1', 'C2']}
{'B': ['B3', 'B4'], 'C': ['C3', 'C4'], 'D': ['D3', 'D4']}
    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   C   D
3  B3  C3  D3
4  B4  C4  D4
     A   B   C    D
1   A1  B1  C1  NaN
2   A2  B2  C2  NaN
3  NaN  B3  C3   D3
4  NaN  B4  C4   D4
C:\Users\PK\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  This is separate from the ipykernel package so we can avoid doing imports until
In [146]:
# concat with 'inner' join
pd.concat([df5,df6],join='inner')
Out[146]:
B C
1 B1 C1
2 B2 C2
3 B3 C3
4 B4 C4
In [150]:
# concat with specify column
pd.concat([df5,df6], join_axes=[df6.columns])
C:\Users\PK\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
  
Out[150]:
B C D
1 B1 C1 NaN
2 B2 C2 NaN
3 B3 C3 D3
4 B4 C4 D4

Append method

In [156]:
pd.concat([df5,df6]) == df5.append(df6) # samething
C:\Users\PK\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  """Entry point for launching an IPython kernel.
Out[156]:
A B C D
1 True True True False
2 True True True False
3 False True True True
4 False True True True

Combining Datasets: Merge and Join

  • pd.merge
  • join()

One to one join

In [180]:
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                       'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                   'hire_date': [2004, 2008, 2012, 2014]})
print(df1); print(df2)
  employee        group
0      Bob   Accounting
1     Jake  Engineering
2     Lisa  Engineering
3      Sue           HR
  employee  hire_date
0     Lisa       2004
1      Bob       2008
2     Jake       2012
3      Sue       2014
In [167]:
df3 = pd.merge(df1,df2)
df3
Out[167]:
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014

Many to one

In [168]:
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
               'supervisor': ['Carly', 'Guido', 'Steve']})
print(df3); print(df4); print(pd.merge(df3, df4))
  employee        group  hire_date
0      Bob   Accounting       2008
1     Jake  Engineering       2012
2     Lisa  Engineering       2004
3      Sue           HR       2014
         group supervisor
0   Accounting      Carly
1  Engineering      Guido
2           HR      Steve
  employee        group  hire_date supervisor
0      Bob   Accounting       2008      Carly
1     Jake  Engineering       2012      Guido
2     Lisa  Engineering       2004      Guido
3      Sue           HR       2014      Steve

Many-to-many joins

In [169]:
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting','Engineering', 'Engineering', 'HR', 'HR'],
                    'skills': ['math', 'spreadsheets', 'coding', 'linux','spreadsheets', 'organization']})
df5
Out[169]:
group skills
0 Accounting math
1 Accounting spreadsheets
2 Engineering coding
3 Engineering linux
4 HR spreadsheets
5 HR organization
In [171]:
# many value got duplicate so shold consider use option in pd.merge()
pd.merge(pd.merge(df3, df4), df5)
Out[171]:
employee group hire_date supervisor skills
0 Bob Accounting 2008 Carly math
1 Bob Accounting 2008 Carly spreadsheets
2 Jake Engineering 2012 Guido coding
3 Jake Engineering 2012 Guido linux
4 Lisa Engineering 2004 Guido coding
5 Lisa Engineering 2004 Guido linux
6 Sue HR 2014 Steve spreadsheets
7 Sue HR 2014 Steve organization

Specification of the Merge Key

On keyword

work only same column name

In [174]:
print(df1) ; print(df2)
pd.merge(df1,df2,on='employee' ) 
  employee        group
0      Bob   Accounting
1     Jake  Engineering
2     Lisa  Engineering
3      Sue           HR
  employee  hire_date
0     Lisa       2004
1      Bob       2008
2     Jake       2012
3      Sue       2014
Out[174]:
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014

left_on , right_on keyword

specify each column
left_on ~ specify with column to join from left table
right_on ~ specify with column to join from right table

In [183]:
# can join column with diff name 
df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                  'salary': [70000, 80000, 120000, 90000]})
print(df1); print(df3)
df1_df3 = pd.merge(df1,df3, left_on='employee', right_on='name')
df1_df3
  employee        group
0      Bob   Accounting
1     Jake  Engineering
2     Lisa  Engineering
3      Sue           HR
   name  salary
0   Bob   70000
1  Jake   80000
2  Lisa  120000
3   Sue   90000
Out[183]:
employee group name salary
0 Bob Accounting Bob 70000
1 Jake Engineering Jake 80000
2 Lisa Engineering Lisa 120000
3 Sue HR Sue 90000
In [187]:
# the result has redundant column that we can drop
df1_df3.drop('name',axis='columns') # if column has to specify axis
Out[187]:
employee group salary
0 Bob Accounting 70000
1 Jake Engineering 80000
2 Lisa Engineering 120000
3 Sue HR 90000

The left_index and right_index keywords

In [190]:
# merge on index
df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
print(df1a); print(df2a)
                group
employee             
Bob        Accounting
Jake      Engineering
Lisa      Engineering
Sue                HR
          hire_date
employee           
Lisa           2004
Bob            2008
Jake           2012
Sue            2014
In [191]:
pd.merge(df1a,df2a, left_index=True, right_index=True)
Out[191]:
group hire_date
employee
Bob Accounting 2008
Jake Engineering 2012
Lisa Engineering 2004
Sue HR 2014
In [192]:
# same thing as join() which is merge on index
pd.merge(df1a,df2a, left_index=True, right_index=True) == df1a.join(df2a)
Out[192]:
group hire_date
employee
Bob True True
Jake True True
Lisa True True
Sue True True

Specifying Set Arithmetic for Joins

In [6]:
df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
                    'food': ['fish', 'beans', 'bread']},
                    columns=['name', 'food'])

df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
                   'drink': ['wine', 'beer']},
                    columns=['name', 'drink'])

print(df6); print(df7)
    name   food
0  Peter   fish
1   Paul  beans
2   Mary  bread
     name drink
0    Mary  wine
1  Joseph  beer
In [5]:
# merge default with inner join
pd.merge(df6, df7)
Out[5]:
name food drink
0 Mary bread wine
In [7]:
# get everything (Union)
pd.merge(df6,df7, how='outer')
Out[7]:
name food drink
0 Peter fish NaN
1 Paul beans NaN
2 Mary bread wine
3 Joseph NaN beer
In [8]:
# also 'left' and 'right' for left join and right join
pd.merge(df6,df7, how='left')
Out[8]:
name food drink
0 Peter fish NaN
1 Paul beans NaN
2 Mary bread wine

Overlapping Column Names: The suffixes Keyword

Deal with acase where your two input DataFrames have conflicting column names.

In [15]:
df8 = pd.DataFrame( {'name' : ['Bob','Jake', 'Lisa', 'Sue'],
                     'rank' : [1, 2, 3, 4]})

df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'rank': [3, 1, 4, 2]})

print(df8); print(df9)
   name  rank
0   Bob     1
1  Jake     2
2  Lisa     3
3   Sue     4
   name  rank
0   Bob     3
1  Jake     1
2  Lisa     4
3   Sue     2
In [16]:
# conflict value will create two separate columns _x,_y
pd.merge(df8, df9, on="name")
Out[16]:
name rank_x rank_y
0 Bob 1 3
1 Jake 2 1
2 Lisa 3 4
3 Sue 4 2
In [18]:
# specify the suffixes instead of _x, _y
pd.merge(df8,df9, on='name',suffixes=(' from left table',' from right table'))
Out[18]:
name rank from left table rank from right table
0 Bob 1 3
1 Jake 2 1
2 Lisa 3 4
3 Sue 4 2
In [24]:
#Example case of data merge
# ex.1 rank US states and territories by their 2010 population density.
pop = pd.read_csv('data/state-population.csv')
areas = pd.read_csv('data\state-areas.csv') #read_csv work with both '/' and '\'
abb = pd.read_csv('data/state-abbrevs.csv')
In [27]:
pop.head() 
Out[27]:
state/region ages year population
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0
2 AL under18 2010 1130966.0
3 AL total 2010 4785570.0
4 AL under18 2011 1125763.0
In [26]:
 abb.head()
Out[26]:
state abbreviation
0 Alabama AL
1 Alaska AK
2 Arizona AZ
3 Arkansas AR
4 California CA
In [30]:
areas.head()
Out[30]:
state area (sq. mi)
0 Alabama 52423
1 Alaska 656425
2 Arizona 114006
3 Arkansas 53182
4 California 163707
In [49]:
# join pop with abb with outer
df1 = pd.merge(pop,abb, left_on='state/region', right_on='abbreviation',how='outer')
df1 = df1.drop('abbreviation',axis = 1)
df1
Out[49]:
state/region ages year population state
0 AL under18 2012 1117489.0 Alabama
1 AL total 2012 4817528.0 Alabama
2 AL under18 2010 1130966.0 Alabama
3 AL total 2010 4785570.0 Alabama
4 AL under18 2011 1125763.0 Alabama
... ... ... ... ... ...
2539 USA total 2010 309326295.0 NaN
2540 USA under18 2011 73902222.0 NaN
2541 USA total 2011 311582564.0 NaN
2542 USA under18 2012 73708179.0 NaN
2543 USA total 2012 313873685.0 NaN

2544 rows × 5 columns

In [50]:
# check NaN value for data cleaning
df1.isnull().any() # population and state have some null values
Out[50]:
state/region    False
ages            False
year            False
population       True
state            True
dtype: bool
In [66]:
# find what is Nan
df1[df1['population'].isnull()]  # state/region	is Puerto Rico prior to the year 2000 (old data)
Out[66]:
state/region ages year population state
2448 PR under18 1990 NaN NaN
2449 PR total 1990 NaN NaN
2450 PR total 1991 NaN NaN
2451 PR under18 1991 NaN NaN
2452 PR total 1993 NaN NaN
2453 PR under18 1993 NaN NaN
2454 PR under18 1992 NaN NaN
2455 PR total 1992 NaN NaN
2456 PR under18 1994 NaN NaN
2457 PR total 1994 NaN NaN
2458 PR total 1995 NaN NaN
2459 PR under18 1995 NaN NaN
2460 PR under18 1996 NaN NaN
2461 PR total 1996 NaN NaN
2462 PR under18 1998 NaN NaN
2463 PR total 1998 NaN NaN
2464 PR total 1997 NaN NaN
2465 PR under18 1997 NaN NaN
2466 PR total 1999 NaN NaN
2467 PR under18 1999 NaN NaN
In [80]:
# return state/region as series that has 'state' value null and .unique for get only unique value
df1.loc[df1['state'].isnull()]['state/region'].unique()
Out[80]:
array(['PR', 'USA'], dtype=object)
In [99]:
# fill the 'state' column of 'PR' and 'USA'
df1.loc[df1['state/region']== 'PR','state'] = 'Puerto Rico'
df1.loc[df1['state/region']== 'USA','state'] = 'United States'
df1
Out[99]:
state/region ages year population state
0 AL under18 2012 1117489.0 Alabama
1 AL total 2012 4817528.0 Alabama
2 AL under18 2010 1130966.0 Alabama
3 AL total 2010 4785570.0 Alabama
4 AL under18 2011 1125763.0 Alabama
... ... ... ... ... ...
2539 USA total 2010 309326295.0 United States
2540 USA under18 2011 73902222.0 United States
2541 USA total 2011 311582564.0 United States
2542 USA under18 2012 73708179.0 United States
2543 USA total 2012 313873685.0 United States

2544 rows × 5 columns

In [113]:
# check again state is okay, since we dont have population information we can finish with data cleaning
df1.isnull().any()
Out[113]:
state/region    False
ages            False
year            False
population       True
state           False
dtype: bool
In [117]:
# go on and merge with areas
df2 =pd.merge(df1,areas,on='state',how='left')
df2
Out[117]:
state/region ages year population state area (sq. mi)
0 AL under18 2012 1117489.0 Alabama 52423.0
1 AL total 2012 4817528.0 Alabama 52423.0
2 AL under18 2010 1130966.0 Alabama 52423.0
3 AL total 2010 4785570.0 Alabama 52423.0
4 AL under18 2011 1125763.0 Alabama 52423.0
... ... ... ... ... ... ...
2539 USA total 2010 309326295.0 United States NaN
2540 USA under18 2011 73902222.0 United States NaN
2541 USA total 2011 311582564.0 United States NaN
2542 USA under18 2012 73708179.0 United States NaN
2543 USA total 2012 313873685.0 United States NaN

2544 rows × 6 columns

In [118]:
# check again
df2.isnull().any() # some area is null
Out[118]:
state/region     False
ages             False
year             False
population        True
state            False
area (sq. mi)     True
dtype: bool
In [123]:
# get unique state that has null value of 'area (sq. mi)'
df2.loc[df2['area (sq. mi)'].isnull(),'state'].unique()
Out[123]:
array(['United States'], dtype=object)
In [129]:
# jsut drop united states
df2.dropna(inplace=True)
df2
Out[129]:
state/region ages year population state area (sq. mi)
0 AL under18 2012 1117489.0 Alabama 52423.0
1 AL total 2012 4817528.0 Alabama 52423.0
2 AL under18 2010 1130966.0 Alabama 52423.0
3 AL total 2010 4785570.0 Alabama 52423.0
4 AL under18 2011 1125763.0 Alabama 52423.0
... ... ... ... ... ... ...
2491 PR under18 2010 896945.0 Puerto Rico 3515.0
2492 PR under18 2011 869327.0 Puerto Rico 3515.0
2493 PR total 2011 3686580.0 Puerto Rico 3515.0
2494 PR under18 2012 841740.0 Puerto Rico 3515.0
2495 PR total 2012 3651545.0 Puerto Rico 3515.0

2476 rows × 6 columns

In [136]:
# ex.1 rank US states and territories by their 2010 population density.
# step 1 filter only year 2010 (could use package numexpr with query() for faster query())
df3 = df2[df2['ages']=='total']
df4 = df3[df3['year']==2010 ]
df4 # this is all data we need to answer ex.1
Out[136]:
state/region ages year population state area (sq. mi)
3 AL total 2010 4785570.0 Alabama 52423.0
91 AK total 2010 713868.0 Alaska 656425.0
101 AZ total 2010 6408790.0 Arizona 114006.0
189 AR total 2010 2922280.0 Arkansas 53182.0
197 CA total 2010 37333601.0 California 163707.0
283 CO total 2010 5048196.0 Colorado 104100.0
293 CT total 2010 3579210.0 Connecticut 5544.0
379 DE total 2010 899711.0 Delaware 1954.0
389 DC total 2010 605125.0 District of Columbia 68.0
475 FL total 2010 18846054.0 Florida 65758.0
485 GA total 2010 9713248.0 Georgia 59441.0
570 HI total 2010 1363731.0 Hawaii 10932.0
581 ID total 2010 1570718.0 Idaho 83574.0
666 IL total 2010 12839695.0 Illinois 57918.0
677 IN total 2010 6489965.0 Indiana 36420.0
762 IA total 2010 3050314.0 Iowa 56276.0
773 KS total 2010 2858910.0 Kansas 82282.0
858 KY total 2010 4347698.0 Kentucky 40411.0
869 LA total 2010 4545392.0 Louisiana 51843.0
954 ME total 2010 1327366.0 Maine 35387.0
965 MD total 2010 5787193.0 Maryland 12407.0
1050 MA total 2010 6563263.0 Massachusetts 10555.0
1061 MI total 2010 9876149.0 Michigan 96810.0
1146 MN total 2010 5310337.0 Minnesota 86943.0
1157 MS total 2010 2970047.0 Mississippi 48434.0
1242 MO total 2010 5996063.0 Missouri 69709.0
1253 MT total 2010 990527.0 Montana 147046.0
1338 NE total 2010 1829838.0 Nebraska 77358.0
1349 NV total 2010 2703230.0 Nevada 110567.0
1434 NH total 2010 1316614.0 New Hampshire 9351.0
1445 NJ total 2010 8802707.0 New Jersey 8722.0
1530 NM total 2010 2064982.0 New Mexico 121593.0
1541 NY total 2010 19398228.0 New York 54475.0
1626 NC total 2010 9559533.0 North Carolina 53821.0
1637 ND total 2010 674344.0 North Dakota 70704.0
1722 OH total 2010 11545435.0 Ohio 44828.0
1733 OK total 2010 3759263.0 Oklahoma 69903.0
1818 OR total 2010 3837208.0 Oregon 98386.0
1829 PA total 2010 12710472.0 Pennsylvania 46058.0
1914 RI total 2010 1052669.0 Rhode Island 1545.0
1925 SC total 2010 4636361.0 South Carolina 32007.0
2010 SD total 2010 816211.0 South Dakota 77121.0
2021 TN total 2010 6356683.0 Tennessee 42146.0
2106 TX total 2010 25245178.0 Texas 268601.0
2117 UT total 2010 2774424.0 Utah 84904.0
2202 VT total 2010 625793.0 Vermont 9615.0
2213 VA total 2010 8024417.0 Virginia 42769.0
2298 WA total 2010 6742256.0 Washington 71303.0
2309 WV total 2010 1854146.0 West Virginia 24231.0
2394 WI total 2010 5689060.0 Wisconsin 65503.0
2405 WY total 2010 564222.0 Wyoming 97818.0
2490 PR total 2010 3721208.0 Puerto Rico 3515.0
In [140]:
# find that answer 
df4['pop_density'] = df4['population'] / df4['area (sq. mi)']
df4.head()
Out[140]:
state/region ages year population state area (sq. mi) pop_density
3 AL total 2010 4785570.0 Alabama 52423.0 91.287603
91 AK total 2010 713868.0 Alaska 656425.0 1.087509
101 AZ total 2010 6408790.0 Arizona 114006.0 56.214497
189 AR total 2010 2922280.0 Arkansas 53182.0 54.948667
197 CA total 2010 37333601.0 California 163707.0 228.051342
In [175]:
# get only column state and pop_density for answer
df5 = df4[['state','pop_density']]
# change index to state
df5.set_index('state',inplace = True)
#sort by desc
df5.sort_values('pop_density',ascending= False).head()
Out[175]:
pop_density
state
District of Columbia 8898.897059
Puerto Rico 1058.665149
New Jersey 1009.253268
Rhode Island 681.339159
Connecticut 645.600649

Aggregation and Grouping

In [44]:
#import planet data set from seaborn
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape
Out[44]:
(1035, 6)
In [45]:
planets.head()
Out[45]:
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009

Simple Aggregation in Pandas

In [15]:
df = pd.DataFrame({ 'A' : [1,2,3],
                    'B' : [4,5,6]})
print(df.sum())
df
A     6
B    15
dtype: int64
Out[15]:
A B
0 1 4
1 2 5
2 3 6
In [16]:
# or aggregate along row with axis = 'columns'
df.sum(axis = 'columns')
Out[16]:
0    5
1    7
2    9
dtype: int64
In [19]:
# summary statistic with describe()
planets.dropna().describe()
Out[19]:
number orbital_period mass distance year
count 498.00000 498.000000 498.000000 498.000000 498.000000
mean 1.73494 835.778671 2.509320 52.068213 2007.377510
std 1.17572 1469.128259 3.636274 46.596041 4.167284
min 1.00000 1.328300 0.003600 1.350000 1989.000000
25% 1.00000 38.272250 0.212500 24.497500 2005.000000
50% 1.00000 357.000000 1.245000 39.940000 2009.000000
75% 2.00000 999.600000 2.867500 59.332500 2011.000000
max 6.00000 17337.500000 25.000000 354.000000 2014.000000

Listing of Pandas aggregation methods

Aggregation Description
count() Total number of items
first(), last() First and last item
mean(), median() Mean and median
min(), max() Minimum and maximum
std(), var() Standard deviation and variance
mad() Mean absolute deviation
prod() Product of all items
sum() Sum of all items

GroupBy: Split, Apply, Combine

In [21]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                  'data': range(6)}, 
                  columns=['key', 'data'])
df
Out[21]:
key data
0 A 0
1 B 1
2 C 2
3 A 3
4 B 4
5 C 5
In [23]:
# using groupby
df.groupby('key') # what is returned is not a set of DataFrames, but a DataFrameGroupByobject. 
# it is a special view of the DataFrame, which is poised to dig into the groups but does no actual computation until the aggregation is applied
Out[23]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002C79378F208>
In [24]:
# to produce a result, we add the aggregate function
df.groupby('key').sum()
Out[24]:
data
key
A 3
B 5
C 7
In [26]:
# object can be index with [] same as DataFrame
planets.groupby('method')['orbital_period'].mean()
Out[26]:
method
Astrometry                          631.180000
Eclipse Timing Variations          4751.644444
Imaging                          118247.737500
Microlensing                       3153.571429
Orbital Brightness Modulation         0.709307
Pulsar Timing                      7343.021201
Pulsation Timing Variations        1170.000000
Radial Velocity                     823.354680
Transit                              21.102073
Transit Timing Variations            79.783500
Name: orbital_period, dtype: float64
In [28]:
# also can be interated
for (method,group) in planets.groupby('method'):
    print("{0:30s} shape={1}".format(method, group.shape)) # this can also be done with apply()
Astrometry                     shape=(2, 6)
Eclipse Timing Variations      shape=(9, 6)
Imaging                        shape=(38, 6)
Microlensing                   shape=(23, 6)
Orbital Brightness Modulation  shape=(3, 6)
Pulsar Timing                  shape=(5, 6)
Pulsation Timing Variations    shape=(1, 6)
Radial Velocity                shape=(553, 6)
Transit                        shape=(397, 6)
Transit Timing Variations      shape=(4, 6)
In [34]:
# using group by with describe 
planets.groupby('method')['orbital_period'].describe()
Out[34]:
count mean std min 25% 50% 75% max
method
Astrometry 2.0 631.180000 544.217663 246.360000 438.770000 631.180000 823.590000 1016.000000
Eclipse Timing Variations 9.0 4751.644444 2499.130945 1916.250000 2900.000000 4343.500000 5767.000000 10220.000000
Imaging 12.0 118247.737500 213978.177277 4639.150000 8343.900000 27500.000000 94250.000000 730000.000000
Microlensing 7.0 3153.571429 1113.166333 1825.000000 2375.000000 3300.000000 3550.000000 5100.000000
Orbital Brightness Modulation 3.0 0.709307 0.725493 0.240104 0.291496 0.342887 0.943908 1.544929
Pulsar Timing 5.0 7343.021201 16313.265573 0.090706 25.262000 66.541900 98.211400 36525.000000
Pulsation Timing Variations 1.0 1170.000000 NaN 1170.000000 1170.000000 1170.000000 1170.000000 1170.000000
Radial Velocity 553.0 823.354680 1454.926210 0.736540 38.021000 360.200000 982.000000 17337.500000
Transit 397.0 21.102073 46.185893 0.355000 3.160630 5.714932 16.145700 331.600590
Transit Timing Variations 3.0 79.783500 71.599884 22.339500 39.675250 57.011000 108.505500 160.000000

Aggregate, filter, transform, apply

GroupBy objects have aggregate(), filter(), transform(), and apply()

In [5]:
df = pd.DataFrame({   'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                    'data1': range(6),
                    'data2': np.random.randint(0, 10, 6)},
                    columns = ['key', 'data1', 'data2'])
df
Out[5]:
key data1 data2
0 A 0 5
1 B 1 8
2 C 2 0
3 A 3 2
4 B 4 6
5 C 5 9
In [51]:
# aggregate()
df.groupby('key').aggregate(['first', np.argmin	, 'prod']) # use with aggregate function to specify directly 
C:\Users\PK\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:61: FutureWarning: 
The current behaviour of 'Series.argmin' is deprecated, use 'idxmin'
instead.
The behavior of 'argmin' will be corrected to return the positional
minimum in the future. For now, use 'series.values.argmin' or
'np.argmin(np.array(values))' to get the position of the minimum
row.
  return bound(*args, **kwds)
Out[51]:
data1 data2
first argmin prod first argmin prod
key
A 0 0 0 9 3 36
B 1 1 4 9 4 54
C 2 2 10 5 5 15
In [55]:
# also can customize for each column
df.groupby('key').aggregate({'data1' : ['min','max'], 'data2' : 'max'})
Out[55]:
data1 data2
min max max
key
A 0 3 9
B 1 4 9
C 2 5 5
In [74]:
#filtering() drop data based on the group properties.
def filter_func(x):
    return x['data2'].std() > 2

print(df); print(df.groupby('key').std());
  key  data1  data2
0   A      0      9
1   B      1      9
2   C      2      5
3   A      3      4
4   B      4      6
5   C      5      3
       data1     data2
key                   
A    2.12132  3.535534
B    2.12132  2.121320
C    2.12132  1.414214
In [78]:
# for only A and B for data2 have more than 2 std
df.groupby('key').filter(filter_func) # return original table with filter 
Out[78]:
key data1 data2
0 A 0 9
1 B 1 9
3 A 3 4
4 B 4 6
In [83]:
# Transformation .transform() 
# While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine
df.groupby('key').transform(lambda x: x - x.mean())
Out[83]:
data1 data2
0 -1.5 2.5
1 -1.5 1.5
2 -1.5 1.0
3 1.5 -2.5
4 1.5 -1.5
5 1.5 -1.0
In [6]:
# apply()  .apply() 
# lets you apply an arbitrary function to the group results. The function take a DataFrame, and return either a Pandas object or a scalar;
#sample normalize column by Z-score
def normalize(x):
    ''' x is a DataFrame'''
    x['data1'] = (x['data1'] - x['data1'].mean()) / x['data1'].std() # find Z score
    x['data2'] = (x['data2'] - x['data2'].mean()) / x['data2'].std()
    return x

df.groupby('key').apply(normalize)
Out[6]:
key data1 data2
0 A -0.707107 0.707107
1 B -0.707107 0.707107
2 C -0.707107 0.707107
3 A 0.707107 -0.707107
4 B 0.707107 -0.707107
5 C 0.707107 -0.707107

Specifying the split key

The key can be any series or list with a length matching that of the DataFrame

In [32]:
L = [0, 1, 0, 1, 2, 0] # specify column to groupby 
df
Out[32]:
key data1 data2
0 A 0 5
1 B 1 8
2 C 2 0
3 A 3 2
4 B 4 6
5 C 5 9
In [33]:
# group by array
df.groupby(L).sum() 
# this one means sum first, third and sixth column to 0th row. sum second and forth column to 1st row. sum fifth column to 2nd row
Out[33]:
data1 data2
0 7 14
1 4 10
2 4 6
In [37]:
print(df['key'])
df.groupby(df['key']).sum() # great way accumulate for keys
0    A
1    B
2    C
3    A
4    B
5    C
Name: key, dtype: object
Out[37]:
data1 data2
key
A 3 7
B 5 14
C 7 9
In [38]:
# A dictionary or series mapping index to group.
df2 = df.set_index('key')
mapping = {'A': 'vowel', 
           'B': 'consonant', 
           'C': 'consonant'}

print(df2); print(df2.groupby(mapping).sum()) #aggregate for keys
     data1  data2
key              
A        0      5
B        1      8
C        2      0
A        3      2
B        4      6
C        5      9
           data1  data2
consonant     12     23
vowel          3      7
In [41]:
# Any Python function. (lower case)
df2.groupby(str.lower).mean()
Out[41]:
data1 data2
a 1.5 3.5
b 2.5 7.0
c 3.5 4.5
In [43]:
# can also use multiIndex
df2.groupby([mapping,str.lower]).mean()
Out[43]:
data1 data2
consonant b 2.5 7.0
c 3.5 4.5
vowel a 1.5 3.5
In [46]:
# Grouping example : count discovered planets by method and by decade
planets.head()
Out[46]:
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
In [80]:
# create decade per row
decade = (planets['year']//10)*10
decade = decade.astype(str)+'s'
decade.name = 'decade'
# group multiIndex by method and decade
planets.groupby(['method',decade])['number'].sum()
Out[80]:
method                         decade
Astrometry                     2010s       2
Eclipse Timing Variations      2000s       5
                               2010s      10
Imaging                        2000s      29
                               2010s      21
Microlensing                   2000s      12
                               2010s      15
Orbital Brightness Modulation  2010s       5
Pulsar Timing                  1990s       9
                               2000s       1
                               2010s       1
Pulsation Timing Variations    2000s       1
Radial Velocity                1980s       1
                               1990s      52
                               2000s     475
                               2010s     424
Transit                        2000s      64
                               2010s     712
Transit Timing Variations      2010s       9
Name: number, dtype: int64
In [78]:
planets.groupby(['method',decade])['number'].sum().unstack().fillna(0)
Out[78]:
decade 1980s 1990s 2000s 2010s
method
Astrometry 0.0 0.0 0.0 2.0
Eclipse Timing Variations 0.0 0.0 5.0 10.0
Imaging 0.0 0.0 29.0 21.0
Microlensing 0.0 0.0 12.0 15.0
Orbital Brightness Modulation 0.0 0.0 0.0 5.0
Pulsar Timing 0.0 9.0 1.0 1.0
Pulsation Timing Variations 0.0 0.0 1.0 0.0
Radial Velocity 1.0 52.0 475.0 424.0
Transit 0.0 0.0 64.0 712.0
Transit Timing Variations 0.0 0.0 0.0 9.0

Pivot Tables

think of pivot tables as essentially a multidimensional version of GroupBy aggregation

In [3]:
# use example data from sns
import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic.head()
Out[3]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
In [6]:
# find survival rate by sex  and class with groupby
titanic.groupby(['sex','class'])[['survived']].mean().unstack()
Out[6]:
survived
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447
In [14]:
# with multidimension better use Pivot table
titanic.pivot_table('survived',index='sex', columns='class') # more read-able than groupby
Out[14]:
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447

Multilevel pivot tables

grouping in pivot tables can be specified with multiple levels interested in looking at age as a third dimension.

In [17]:
# create age range per row and use as index
age = pd.cut(titanic['age'], [0,18,80])# create range from 0-18 and 18-80
titanic.pivot_table('survived', index=['sex', age], columns='class')
Out[17]:
class First Second Third
sex age
female (0, 18] 0.909091 1.000000 0.511628
(18, 80] 0.972973 0.900000 0.423729
male (0, 18] 0.800000 0.600000 0.215686
(18, 80] 0.375000 0.071429 0.133663
In [30]:
# range for column
fare = pd.qcut(titanic['fare'], 2) # create auto quantile 2 range 
titanic.pivot_table('survived', index=['sex',age], columns=[fare,'class'])
Out[30]:
fare (-0.001, 14.454] (14.454, 512.329]
class First Second Third First Second Third
sex age
female (0, 18] NaN 1.000000 0.714286 0.909091 1.000000 0.318182
(18, 80] NaN 0.880000 0.444444 0.972973 0.914286 0.391304
male (0, 18] NaN 0.000000 0.260870 0.800000 0.818182 0.178571
(18, 80] 0.0 0.098039 0.125000 0.391304 0.030303 0.192308
In [32]:
# margin for ALL
titanic.pivot_table('survived', index='sex', columns='class', margins=True)
Out[32]:
class First Second Third All
sex
female 0.968085 0.921053 0.500000 0.742038
male 0.368852 0.157407 0.135447 0.188908
All 0.629630 0.472826 0.242363 0.383838
In [40]:
# sample data with Birth date
births = pd.read_csv('data/births.csv')
births.head()
Out[40]:
year month day gender births
0 1969 1 1.0 F 4046
1 1969 1 1.0 M 4440
2 1969 1 2.0 F 4454
3 1969 1 2.0 M 4548
4 1969 1 3.0 F 4548
In [48]:
decade = (births['year'] //10) *10 # create decade range by 10 years
births.pivot_table('births', index=decade, columns='gender')
Out[48]:
gender F M
year
1960 4566.755208 4808.781250
1970 4267.403569 4497.386393
1980 5460.886072 5740.886635
1990 162328.783333 170171.275000
2000 168789.898148 176911.370370
In [49]:
%matplotlib inline
import matplotlib.pyplot as plt
sns.set() # use Seaborn styles
births.pivot_table('births', index='year', columns='gender', aggfunc='sum').plot()
plt.ylabel('total births per year');

Vectorized String Operations

Pandas string operations, clean up string with
$.str$

In [78]:
names = pd.Series(['peter','Paul',None,'MARY','Party','gUIDO'])
names
Out[78]:
0    peter
1     Paul
2     None
3     MARY
4    Party
5    gUIDO
dtype: object
In [79]:
# Capitalize all elements while skipping missing values
names.str.capitalize()
Out[79]:
0    Peter
1     Paul
2     None
3     Mary
4    Party
5    Guido
dtype: object
In [68]:
'''
All Python String method
len() lower() translate() islower()
ljust() upper() startswith() isupper()
rjust() find() endswith() isnumeric()
center() rfind() isalnum() isdecimal()
zfill() index() isalpha() split()
strip() rindex() isdigit() rsplit()
rstrip() capitalize() isspace() partition()
lstrip() swapcase() istitle() rpartition()
'''
Out[68]:
'\nAll Python String method\nlen() lower() translate() islower()\nljust() upper() startswith() isupper()\nrjust() find() endswith() isnumeric()\ncenter() rfind() isalnum() isdecimal()\nzfill() index() isalpha() split()\nstrip() rindex() isdigit() rsplit()\nrstrip() capitalize() isspace() partition()\nlstrip() swapcase() istitle() rpartition()\n'
In [87]:
# ex. return names start with P
names[names.str.startswith('P').fillna(False)] # see that peter is in lower case so not return
Out[87]:
1     Paul
4    Party
dtype: object

Methods using regular expressions

Mapping between Pandas methods and functions in Python’s re module

Method Description
match() Call re.match() on each element, returning a Boolean.
extract() Call re.match() on each element, returning matched groups as strings
findall() Call re.findall() on each element.
replace() Replace occurrences of pattern with some other string.
contains() Call re.search() on each element, returning a Boolean.
count() Count occurrences of pattern.
split() Equivalent to str.split(), but accepts regexps.
rsplit() Equivalent to str.rsplit(), but accepts regexps.

Other Pandas string methods

Method Description
get() Index each element
slice() Slice each element
slice_replace() Replace slice in each element with passed value
cat() Concatenate strings
repeat() Repeat values
normalize() Return Unicode form of string
pad() Add whitespace to left, right, or both sides of strings
wrap() Split long strings into lines with length less than a given width
join() Join strings in each element of the Series with passed separator
get_dummies() Extract dummy variables as a DataFrame
In [89]:
names.str[0:3] # get slice of first three character
Out[89]:
0     pet
1     Pau
2    None
3     MAR
4     Par
5     gUI
dtype: object
In [96]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam','Eric Idle', 'Terry Jones', 'Michael Palin'])
# get last name
print(monte)
monte.str.split().str.get(-1) # split give list of twe str and we get the last one with get
0    Graham Chapman
1       John Cleese
2     Terry Gilliam
3         Eric Idle
4       Terry Jones
5     Michael Palin
dtype: object
Out[96]:
0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object
In [98]:
full_monte = pd.DataFrame({'name': monte,
                           'info': ['B|C|D', 'B|D', 'A|C', 'B|D', 'B|C','B|C|D']})
full_monte
Out[98]:
name info
0 Graham Chapman B|C|D
1 John Cleese B|D
2 Terry Gilliam A|C
3 Eric Idle B|D
4 Terry Jones B|C
5 Michael Palin B|C|D
In [103]:
#get dummy variables
full_monte['info'].str.get_dummies('|')
Out[103]:
A B C D
0 0 1 1 1
1 0 1 0 1
2 1 0 1 0
3 0 1 0 1
4 0 1 1 0
5 0 1 1 1

Working with Time Series (dates, times, and timeindexed data.)

In [110]:
# numpy datetime64 object
date = np.array('2015-07-04', dtype=np.datetime64)
date
Out[110]:
array('2015-07-04', dtype='datetime64[D]')
In [111]:
# then can do vectorize operation
date + np.arange(12)
Out[111]:
array(['2015-07-04', '2015-07-05', '2015-07-06', '2015-07-07',
       '2015-07-08', '2015-07-09', '2015-07-10', '2015-07-11',
       '2015-07-12', '2015-07-13', '2015-07-14', '2015-07-15'],
      dtype='datetime64[D]')
In [112]:
# date time in panda 
date = pd.to_datetime('4th of july, 2016')
Out[112]:
Timestamp('2016-07-04 00:00:00')
In [121]:
print(date.strftime('%D')) #return date
print(date.strftime('%Y')) #return month
print(date.strftime('%d')) #return day
07/04/16
2016
04
In [122]:
date + pd.to_timedelta(np.arange(12), 'D') #add 12 day
Out[122]:
DatetimeIndex(['2016-07-04', '2016-07-05', '2016-07-06', '2016-07-07',
               '2016-07-08', '2016-07-09', '2016-07-10', '2016-07-11',
               '2016-07-12', '2016-07-13', '2016-07-14', '2016-07-15'],
              dtype='datetime64[ns]', freq=None)

Pandas Time Series: Indexing by Time

In [128]:
dates = pd.to_datetime([ '4th of July, 2015','2015-Jul-6', '07-07-2015', '20150708'])
dates - dates[0] # find time different 
Out[128]:
TimedeltaIndex(['0 days', '2 days', '3 days', '4 days'], dtype='timedelta64[ns]', freq=None)

Regular sequences: pd.date_range()

return date return
$pd.date\_range()$ for timestamps,
$pd.period\_range()$ for periods,
$pd.timedelta\_range()$ for time deltas.

In [134]:
print(pd.date_range('2015-07-03','2015-07-10')) # return range (start,stop)
print(pd.date_range('2015-07-03',periods = 8,freq='h')) # return range (start,how many day)
DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-05', '2015-07-06',
               '2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'],
              dtype='datetime64[ns]', freq='D')
DatetimeIndex(['2015-07-03 00:00:00', '2015-07-03 01:00:00',
               '2015-07-03 02:00:00', '2015-07-03 03:00:00',
               '2015-07-03 04:00:00', '2015-07-03 05:00:00',
               '2015-07-03 06:00:00', '2015-07-03 07:00:00'],
              dtype='datetime64[ns]', freq='H')
In [148]:
pd.timedelta_range(0, periods=10, freq='ns') # return only hour
Out[148]:
TimedeltaIndex([       '00:00:00', '00:00:00.000000', '00:00:00.000000',
                '00:00:00.000000', '00:00:00.000000', '00:00:00.000000',
                '00:00:00.000000', '00:00:00.000000', '00:00:00.000000',
                '00:00:00.000000'],
               dtype='timedelta64[ns]', freq='N')
In [149]:
pd.timedelta_range(0, periods=9, freq="2H30T") # step by 2.30 hours
Out[149]:
TimedeltaIndex(['00:00:00', '02:30:00', '05:00:00', '07:30:00', '10:00:00',
                '12:30:00', '15:00:00', '17:30:00', '20:00:00'],
               dtype='timedelta64[ns]', freq='150T')
In [151]:
# pd.tseries.offsets
from pandas.tseries.offsets import BDay
pd.date_range('2015-07-01', periods=5, freq=BDay())
Out[151]:
DatetimeIndex(['2015-07-01', '2015-07-02', '2015-07-03', '2015-07-06',
               '2015-07-07'],
              dtype='datetime64[ns]', freq='B')

Resampling, Shifting, and Windowing

In [385]:
from pandas_datareader import data
goog = data.DataReader('GOOG', start='2004', end='2016',data_source='yahoo')
goog.head()
Out[385]:
High Low Open Close Volume Adj Close
Date
2004-08-19 51.835709 47.800831 49.813286 49.982655 44871300.0 49.982655
2004-08-20 54.336334 50.062355 50.316402 53.952770 22942800.0 53.952770
2004-08-23 56.528118 54.321388 55.168217 54.495735 18342800.0 54.495735
2004-08-24 55.591629 51.591621 55.412300 52.239193 15319700.0 52.239193
2004-08-25 53.798351 51.746044 52.284027 52.802086 9232100.0 52.802086
In [386]:
goog = goog['Close'] #interest only close price
goog.plot() 
Out[386]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ac98d3d408>
In [387]:
# resample() is aggregation datetime by hour, year , day, etc..
'''
All parameter 

B         business day frequency
C         custom business day frequency (experimental)
D         calendar day frequency
W         weekly frequency
M         month end frequency
SM        semi-month end frequency (15th and end of month)
BM        business month end frequency
CBM       custom business month end frequency
MS        month start frequency
SMS       semi-month start frequency (1st and 15th)
BMS       business month start frequency
CBMS      custom business month start frequency
Q         quarter end frequency
BQ        business quarter endfrequency
QS        quarter start frequency
BQS       business quarter start frequency
A         year end frequency
BA, BY    business year end frequency
AS, YS    year start frequency
BAS, BYS  business year start frequency
BH        business hour frequency
H         hourly frequency
T, min    minutely frequency
S         secondly frequency
L, ms     milliseconds
U, us     microseconds
N         nanoseconds
'''
goog.resample('W').mean() # aggregate data by week
Out[387]:
Date
2004-08-22     51.967712
2004-08-29     53.233467
2004-09-05     50.425990
2004-09-12     51.247910
2004-09-19     56.034968
                 ...    
2015-12-06    758.273987
2015-12-13    753.112000
2015-12-20    747.600012
2015-12-27    749.120010
2016-01-03    767.247498
Freq: W-SUN, Name: Close, Length: 594, dtype: float64
In [171]:
# resample() method, or the much simpler asfreq()
# difference between the two is that resample() is fundamentally a data aggregation, while asfreq() is fundamentally a data selection.
goog.plot(alpha= 0.5, style= '-') # set style of line
goog.resample('BA').mean().plot(style=':')
goog.asfreq('BA').plot(style='--');
plt.legend(['input', 'resample', 'asfreq'],
loc='upper left')
# # Notice the difference: at each point, resample reports the average of the previous year,
# while asfreq reports the value at the end of the year.
Out[171]:
<matplotlib.legend.Legend at 0x240944e5388>

High-Performance Pandas: eval() and query()

pandas.eval() for Efficient Operations

The $.eval()$ function uses string expressions to efficiently compute operations using $DataFrame$ support
Arithmetic operators + - * /
Comparison operators > < >= !=
Bitwise operators | Object attributes and indices. $df.loc[ ]$, $df[ ]$

In [186]:
nrows,ncols =100000,100
df1,df2,df3,df4 = (pd.DataFrame(np.random.rand(nrows,ncols))
                    for i in range(4))
In [189]:
df2.head()
Out[189]:
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
0 0.401958 0.483594 0.074981 0.432266 0.751760 0.270394 0.534251 0.062514 0.202321 0.842009 ... 0.436479 0.870256 0.761619 0.289809 0.713318 0.445720 0.835122 0.543900 0.660631 0.345756
1 0.797005 0.369082 0.519192 0.534032 0.643448 0.403721 0.520731 0.076128 0.555917 0.100187 ... 0.695615 0.320592 0.852841 0.697409 0.206876 0.623295 0.099893 0.195659 0.125141 0.499562
2 0.132110 0.295043 0.660783 0.755754 0.136566 0.880015 0.275811 0.482909 0.576981 0.354257 ... 0.295926 0.811195 0.901325 0.391533 0.601865 0.493064 0.415287 0.369999 0.589187 0.435407
3 0.142260 0.812721 0.830352 0.239932 0.020341 0.833940 0.990551 0.826973 0.965190 0.617448 ... 0.891028 0.006948 0.888871 0.377546 0.441353 0.254979 0.376575 0.468171 0.997008 0.861989
4 0.117937 0.702203 0.890969 0.556356 0.269376 0.518024 0.089249 0.407258 0.175893 0.303551 ... 0.140420 0.918936 0.586080 0.125544 0.123225 0.213057 0.600994 0.558258 0.152704 0.751075

5 rows × 100 columns

In [190]:
#addition using typical sum
%timeit df1+df2+df3+df4
85.3 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [193]:
# using .eval() is faster
print(np.allclose(df1 + df2 + df3 + df4,pd.eval('df1 + df2 + df3 + df4'))) # check same thing?
%timeit pd.eval('df1 + df2 + df3 + df4')
True
43.6 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [194]:
#DataFrame.eval() for Column-Wise Operations
df = pd.DataFrame(np.random.rand(1000, 3), columns=['A', 'B', 'C'])
df.head()
Out[194]:
A B C
0 0.734377 0.802887 0.883702
1 0.244489 0.469804 0.765987
2 0.660473 0.333190 0.352312
3 0.956315 0.255889 0.719265
4 0.977485 0.975037 0.491482
In [204]:
# treat column names as variable
result1 = df.A+df['B']/df.loc[:,'C']
result2 = df.eval('A+B/C')
(result2 == result1)
Out[204]:
0      True
1      True
2      True
3      True
4      True
       ... 
995    True
996    True
997    True
998    True
999    True
Length: 1000, dtype: bool
In [208]:
# use eval to create new column inplace
df.eval('D = (A+B)').head()
Out[208]:
A B C D
0 0.734377 0.802887 0.883702 1.537264
1 0.244489 0.469804 0.765987 0.714293
2 0.660473 0.333190 0.352312 0.993663
3 0.956315 0.255889 0.719265 1.212204
4 0.977485 0.975037 0.491482 1.952521
In [209]:
# also work with variable outside (locally)
column_mean = df.mean(1)
result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean') # The @ character here marks a variable name rather than a column name,
np.allclose(result1, result2)
Out[209]:
True

DataFrame.query() Method

filtering operation

In [213]:
df[(df.A < 0.3) & (df.B < 0.3)]
pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
df.query('A<0.5 and B <0.5') # all this 3 is the same (easy to read)
Out[213]:
A B C
1 0.244489 0.469804 0.765987
8 0.148484 0.322732 0.507930
9 0.070821 0.378965 0.794246
12 0.041986 0.015745 0.020521
14 0.369384 0.032253 0.587584
... ... ... ...
980 0.155935 0.118136 0.625820
992 0.415436 0.083995 0.516727
993 0.016609 0.220334 0.887067
997 0.075169 0.209986 0.438690
998 0.252455 0.265873 0.706028

244 rows × 3 columns

Visualization with Matplotlib

multiplatform data visualization library built on NumPy arrays
One of Matplotlib’s most important features is its ability to play well with many operating systems and graphics backends.

In [215]:
import matplotlib as mpl
import matplotlib.pyplot as plt
# set style
plt.style.use('classic')
%matplotlib inline 

Using matplotlib in notebook (only to run once per kernal)

  • %matplotlib notebook ~ will lead to interactive plots embedded within the notebook
  • %matplotlib inline ~ will lead to static images of your plot embedded in the notebook
In [53]:
# sample
x = np.linspace(0,10,100)
fig = plt.figure()                
plt.plot(x, np.sin(x), '-')
plt.plot(x, np.cos(x), '--')
Out[53]:
[<matplotlib.lines.Line2D at 0x291fe91a708>]

Simple Line Plots

In [221]:
plt.style.use('seaborn-whitegrid') # set style
# For all Matplotlib plots, 
fig = plt.figure()      # 1. creating a figure that contains all the objects representing axes, graphics, text, and labels.
ax = plt.axes()         # 2. creating an axes contain the plot elements that make up our visualization
In [224]:
fig = plt.figure()
ax = plt.axes()
# use the ax.plot function to plot some data.
x = np.linspace(0,10,1000)
ax.plot(x,np.sin(x)); #; to end     # use the ax.plot function to plot some data
In [227]:
# or use pylab so that fig and axes get created in the backgroud
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x));

Adjusting the Plot: Line, colors, styles and xlim, ylim

Line Colors and Styles

In [228]:
# color
plt.plot(x, np.sin(x - 0), color='blue') # specify color by name
plt.plot(x, np.sin(x - 1), color='g') # short color code (rgbcmyk)
plt.plot(x, np.sin(x - 2), color='0.75') # Grayscale between 0 and 1
plt.plot(x, np.sin(x - 3), color='#FFDD44') # Hex code (RRGGBB from 00 to FF)
plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 and 1
plt.plot(x, np.sin(x - 5), color='chartreuse'); # all HTML color names supported
In [229]:
# line style
plt.plot(x, x + 4, linestyle='-') # solid
plt.plot(x, x + 5, linestyle='--') # dashed
plt.plot(x, x + 6, linestyle='-.') # dashdot
plt.plot(x, x + 7, linestyle=':'); # dotted

Adjusting the Plot: Axes Limits

plt.xlim( ) and plt.ylim( ) methods

In [234]:
plt.plot(x, np.sin(x))
plt.xlim(-1, 11)
plt.ylim(-1.5, 5);
In [238]:
# The plt.axis() method can set the xand y limits with a single call, by passing a list [xmin, xmax, ymin,ymax]
plt.plot(x, np.sin(x))
plt.axis([-1,11,-1,2])
Out[238]:
[-1, 11, -1, 2]

Labeling Plots

In [54]:
plt.plot(x, np.sin(x), label= 'sin(x)') # label for legend
plt.plot(x, np.cos(x), ':b', label='cos(x)') #label for legend
plt.title("A Sin Curve") 
plt.xlabel("x") # lebel for x axes
plt.ylabel("sin(x)");
plt.legend(); #need for naming legend

Simple Scatter Plots

use character $'o'$ for scatter plot

In [58]:
x = np.linspace(0, 10, 30)
y = np.sin(x)
plt.plot(np.sin(np.linspace(0, 10, 30)), 'o', color='black');
#The third argument in the function call is a character that represents the type of symbol used for the plotting. 
# Just as you can specify options such as '-' and '--'
In [59]:
# additional line to conect
plt.plot(x, y, '-ok'); # line (-), circle marker (o), black (k)
In [252]:
# Demonstration of point numbers
rng = np.random.RandomState(0)
for marker in ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']:
    plt.plot(rng.rand(5), rng.rand(5), marker,
        label="marker='{0}'".format(marker))

Scatter Plots with plt.scatter

more powerful method of creating scatter plot
it can be used to create scatter plots where the properties of each individual point (size, face color, edge color, etc.) can be individually controlled or mapped to data.

In [61]:
rng = np.random.RandomState(0)
x = rng.randn(100)
y = rng.randn(100)
colors = rng.rand(100)
sizes = 1000 * rng.rand(100)

plt.scatter(x, y, c=colors, s=sizes, alpha=0.3,
cmap='viridis')
plt.colorbar(); # show color scale
In [62]:
# test with iris data
from sklearn.datasets import load_iris
iris = load_iris()
features = iris.data.T
In [66]:
# #simultaneously explore four different dimensions of the data: the (x, y) location of each point corresponds to
# the sepal length and width, the size of the point is the petal width, and the color is the particular species of flower
plt.scatter(features[0],features[1],
           alpha = 0.2,
           s= 100* features[3],
           c=iris.target,
           cmap='viridis')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1]);

Visualizing Errors

Basic Errorbars

plt.errorbar(x, y, yerr=dy, fmt='.k');

In [75]:
plt.style.use('seaborn-whitegrid')
#Visulize error (y) from point(x)
x = np.linspace(0,10,50)
dy =0.8
y= np.sin(x) +dy * np.random.randn(50)

plt.errorbar(x, y, yerr=dy, fmt='.k');
In [76]:
#make the errorbars lighter than the points themselves with ecolor='lightgray'
plt.errorbar(x, y, yerr=dy, fmt='o', color='black',
             ecolor='lightgray', elinewidth=3, capsize=0);

Histograms, Binnings, and Density

In [90]:
data = np.random.randn(3000)
plt.hist(data);
In [94]:
# The hist() function has many options to tune both the calculation and the display;
plt.hist(data,bins = 30,density=True, alpha = 0.5,
        histtype = 'stepfilled', color = 'steelblue',
        edgecolor= 'none');
In [96]:
#use histtype='stepfilled' and transparency alpha to compare between distribution
x1 = np.random.normal(0, 0.8, 1000)
x2 = np.random.normal(-2, 1, 1000)
x3 = np.random.normal(3, 2, 1000)

kwargs = dict(histtype='stepfilled', alpha=0.3, density=True, bins=40)

plt.hist(x1, **kwargs)
plt.hist(x2, **kwargs)
plt.hist(x3, **kwargs);
In [97]:
# count the number of points in a given bin and not display it, with np.histogram()
counts, bin_edges = np.histogram(data, bins=5)
print(counts)
[  95  857 1477  530   41]

Two-Dimensional Histograms and Binnings

create histograms in two-dimensions by dividing points among two-dimensional bins.

In [98]:
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 10000).T
In [104]:
# plt.hist2d function:
plt.hist2d(x,y,bins= 30, cmap ='Greens')
cb = plt.colorbar()
cb.set_label('coints in bin')
In [105]:
counts, xedges, yedges = np.histogram2d(x, y, bins=30)

Kernel density estimation

evaluating densities or a way to "smear out" the points in space and add up the result to obtain a smooth function with scipy.stats

In [106]:
from scipy.stats import gaussian_kde

# fit an array of size [Ndim, Nsamples]
data = np.vstack([x, y])
kde = gaussian_kde(data)

# evaluate on a regular grid
xgrid = np.linspace(-3.5, 3.5, 40)
ygrid = np.linspace(-6, 6, 40)
Xgrid, Ygrid = np.meshgrid(xgrid, ygrid)
Z = kde.evaluate(np.vstack([Xgrid.ravel(), Ygrid.ravel()]))

# Plot the result as an image
plt.imshow(Z.reshape(Xgrid.shape),
           origin='lower', aspect='auto',
           extent=[-3.5, 3.5, -6, 6],
           cmap='Blues')
cb = plt.colorbar()
cb.set_label("density")

Multiple Subplots

Subplots: groups of smaller axes that can exist together within a single figure.

In [125]:
# create plt.axes([left, bottom, width, height])
fig = plt.figure()

ax1 = fig.add_axes([0.1, 0.5, 0.8, 0.4],
                   xticklabels=[], ylim=(-1.2, 1.2)) # top axes

ax2 = fig.add_axes([0.1, 0.1, 0.8, 0.4],
                   ylim=(-1.2, 1.2))   # bottom axes

x = np.linspace(0, 10)
ax1.plot(np.sin(x))
ax2.plot(np.cos(x));

Simple Grids of Subplots

plt.subplots() is the easier tool to use (note the s at the end of subplots). Rather than creating a single subplot, this function creates a full grid of subplots in a single line, returning them in a NumPy array.

Here we'll create a 2×3 grid of subplots, where all axes in the same row share their y-axis scale, and all axes in the same column share their x-axis scale:

In [135]:
fig, ax = plt.subplots(2, 3, sharex='col', sharey='row')
In [136]:
# axes are in a two-dimensional array, indexed by [row, col]
for i in range(2):
    for j in range(3):
        ax[i, j].text(0.5, 0.5, str((i, j)),
                      fontsize=18, ha='center')
fig
Out[136]:

plt.GridSpec() object does not create a plot by itself; it is simply a convenient interface that is recognized by the plt.subplot() command. For example, a gridspec for a grid of two rows and three columns with some specified width and height space looks like this:

In [137]:
grid = plt.GridSpec(2, 3, wspace=0.4, hspace=0.3)
In [139]:
plt.subplot(grid[0, 0])
plt.subplot(grid[0, 1:])
plt.subplot(grid[1, :2])
plt.subplot(grid[1, 2]);
In [140]:
# Create some normally distributed data
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 3000).T

# Set up the axes with gridspec
fig = plt.figure(figsize=(6, 6))
grid = plt.GridSpec(4, 4, hspace=0.2, wspace=0.2)
main_ax = fig.add_subplot(grid[:-1, 1:])
y_hist = fig.add_subplot(grid[:-1, 0], xticklabels=[], sharey=main_ax)
x_hist = fig.add_subplot(grid[-1, 1:], yticklabels=[], sharex=main_ax)

# scatter points on the main axes
main_ax.plot(x, y, 'ok', markersize=3, alpha=0.2)

# histogram on the attached axes
x_hist.hist(x, 40, histtype='stepfilled',
            orientation='vertical', color='gray')
x_hist.invert_yaxis()

y_hist.hist(y, 40, histtype='stepfilled',
            orientation='horizontal', color='gray')
y_hist.invert_xaxis()

Visualization with Seaborn

eaborn provides an API on top of Matplotlib that offers sane choices for plot style and color defaults and integrates with the functionality provided by Pandas DataFrames.

In [7]:
# just as these two lines
import seaborn as sns
sns.set()

Histograms, KDE, and densities

In [142]:
data = np.random.multivariate_normal([0,0], [[5,2],[2,2]], size=2000)
data = pd.DataFrame(data, columns= ['x','y'])
for col in 'xy':
    plt.hist(data[col], density=True, alpha = 0.5)
In [15]:
# get a smooth estimate of the distribution using a kernel density estimation
for col in 'xy':
    sns.kdeplot(data[col], shade=True)
In [16]:
# Histograms and KDE can be combined using distplot:
sns.distplot(data['x'])
sns.distplot(data['y']);
In [144]:
# pass two dimensional dataset
with sns.axes_style('white'):
    sns.jointplot("x", "y", data, kind='kde');

Pair plots

This is very useful for exploring correlations between multidimensional data,

In [145]:
# Pair plots
iris = sns.load_dataset("iris")
iris.head()
Out[145]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [147]:
sns.pairplot(iris, hue='species', height = 3);

Faceted histograms

In [148]:
tips = sns.load_dataset('tips')
tips.head()
Out[148]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
In [149]:
# shows the amount that restaurant staff receive in tips based on various indicator data:
tips['tip_pct'] = 100 * tips['tip'] / tips['total_bill']

grid = sns.FacetGrid(tips, row="sex", col="time", margin_titles=True)
grid.map(plt.hist, "tip_pct", bins=np.linspace(0, 40, 15));

Bar plots : Time series

In [150]:
planets = sns.load_dataset('planets')
planets.head()
Out[150]:
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
In [159]:
with sns.axes_style('white'):
    g = sns.catplot("year", data=planets, aspect=2,
                       kind="count", color='steelblue')
    g.set_xticklabels(step=5)
In [160]:
# looking at method  and year and count
with sns.axes_style('white'):
    g = sns.factorplot("year", data=planets, aspect=4.0, kind='count',
                       hue='method', order=range(2001, 2015))
    g.set_ylabels('Number of Planets Discovered')

Example: Exploring Marathon Finishing Times

using Seaborn to help visualize and understand finishing results from a marathon.

In [161]:
# download the data
!curl -O https://raw.githubusercontent.com/jakevdp/marathon-data/master/marathon-data.csv
data = pd.read_csv('marathon-data.csv')
data.head()
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  4  836k    4 35646    0     0  18751      0  0:00:45  0:00:01  0:00:44 18751
 15  836k   15  132k    0     0  50607      0  0:00:16  0:00:02  0:00:14 50607
 21  836k   21  178k    0     0  50094      0  0:00:17  0:00:03  0:00:14 50094
 25  836k   25  213k    0     0  43811      0  0:00:19  0:00:04  0:00:15 43811
 29  836k   29  245k    0     0  43964      0  0:00:19  0:00:05  0:00:14 50374
 33  836k   33  277k    0     0  41054      0  0:00:20  0:00:06  0:00:14 49519
 33  836k   33  277k    0     0  34821      0  0:00:24  0:00:08  0:00:16 27084
 35  836k   35  293k    0     0  34100      0  0:00:25  0:00:08  0:00:17 22832
 35  836k   35  293k    0     0  29684      0  0:00:28  0:00:10  0:00:18 15965
 36  836k   36  309k    0     0  28340      0  0:00:30  0:00:11  0:00:19 11998
 38  836k   38  325k    0     0  27177      0  0:00:31  0:00:12  0:00:19  9206
 38  836k   38  325k    0     0  24665      0  0:00:34  0:00:13  0:00:21  9187
 38  836k   38  325k    0     0  24368      0  0:00:35  0:00:13  0:00:22  6738
 40  836k   40  341k    0     0  23644      0  0:00:36  0:00:14  0:00:22 10538
 42  836k   42  357k    0     0  23171      0  0:00:36  0:00:15  0:00:21 10652
 46  836k   46  389k    0     0  23884      0  0:00:35  0:00:16  0:00:19 14777
 48  836k   48  405k    0     0  23509      0  0:00:36  0:00:17  0:00:19 19744
 52  836k   52  437k    0     0  23814      0  0:00:35  0:00:18  0:00:17 22343
 54  836k   54  453k    0     0  23645      0  0:00:36  0:00:19  0:00:17 23647
 58  836k   58  485k    0     0  24066      0  0:00:35  0:00:20  0:00:15 26975
 59  836k   59  501k    0     0  23716      0  0:00:36  0:00:21  0:00:15 23145
 63  836k   63  533k    0     0  24102      0  0:00:35  0:00:22  0:00:13 26193
 65  836k   65  549k    0     0  23532      0  0:00:36  0:00:23  0:00:13 22492
 67  836k   67  565k    0     0  22897      0  0:00:37  0:00:25  0:00:12 20298
 69  836k   69  581k    0     0  23104      0  0:00:37  0:00:25  0:00:12 19218
 69  836k   69  581k    0     0  22259      0  0:00:38  0:00:26  0:00:12 16075
 71  836k   71  597k    0     0  22124      0  0:00:38  0:00:27  0:00:11 13141
 73  836k   73  613k    0     0  21673      0  0:00:39  0:00:28  0:00:11 12918
 75  836k   75  629k    0     0  21670      0  0:00:39  0:00:29  0:00:10 14707
 77  836k   77  645k    0     0  21568      0  0:00:39  0:00:30  0:00:09 13446
 80  836k   80  677k    0     0  21898      0  0:00:39  0:00:31  0:00:08 19935
 86  836k   86  725k    0     0  22718      0  0:00:37  0:00:32  0:00:05 25970
 90  836k   90  757k    0     0  22590      0  0:00:37  0:00:34  0:00:03 27561
 92  836k   92  773k    0     0  22782      0  0:00:37  0:00:34  0:00:03 29373
 92  836k   92  773k    0     0  22189      0  0:00:38  0:00:35  0:00:03 25965
 98  836k   98  821k    0     0  22909      0  0:00:37  0:00:36  0:00:01 29268
 98  836k   98  821k    0     0  21895      0  0:00:39  0:00:38  0:00:01 17195
100  836k  100  836k    0     0  22258      0  0:00:38  0:00:38 --:--:-- 19523
Out[161]:
age gender split final
0 33 M 01:05:38 02:08:51
1 32 M 01:06:26 02:09:28
2 31 M 01:06:49 02:10:42
3 38 M 01:06:16 02:13:45
4 31 M 01:06:32 02:13:59

By default, Pandas loaded the time columns as Python strings (type object);
We need to fix by import again with function to change to timedelta

In [172]:
pd.Timedelta(hours=1, minutes = 75,seconds=1)
Out[172]:
Timedelta('0 days 02:15:01')
In [173]:
def convert_time(s):
    h, m, s = map(int, s.split(':'))
    return pd.Timedelta(hours=h, minutes = m,seconds=s)

data = pd.read_csv('marathon-data.csv',
                   converters={'split':convert_time, 'final':convert_time})
data.dtypes
Out[173]:
age                 int64
gender             object
split     timedelta64[ns]
final     timedelta64[ns]
dtype: object
In [182]:
# For the purpose of ploting, let's next add columns that give the times in seconds:
data['split_sec'] = data['split']/np.timedelta64(1,'ms') / 1E3
data['final_sec'] = data['final']/np.timedelta64(1,'ms')/ 1E3
data.head()
# split time is time pass half way
Out[182]:
age gender split final split_sec final_sec
0 33 M 01:05:38 02:08:51 3938.0 7731.0
1 32 M 01:06:26 02:09:28 3986.0 7768.0
2 31 M 01:06:49 02:10:42 4009.0 7842.0
3 38 M 01:06:16 02:13:45 3976.0 8025.0
4 31 M 01:06:32 02:13:59 3992.0 8039.0
In [183]:
with sns.axes_style('white'):
    g = sns.jointplot("split_sec", "final_sec", data, kind='hex')
    g.ax_joint.plot(np.linspace(4000, 16000),
                    np.linspace(8000, 32000), ':k')

The dotted line shows where someone's time would lie if they ran the marathon using same time between first half and second half.
The fact that the distribution lies above this indicates (as you might expect) that most people slow down over the course of the marathon (second half use more time). If you have run competitively, you'll know that those who do the opposite—run faster during the second half of the race—are said to have "negative-split" the race.

In [184]:
#create another column, the split fraction, which measures 
# each runner negative-splits or positive-splits the race:
data['split_frac'] = 1 - 2 * data['split_sec'] / data['final_sec']
data.head(3)
Out[184]:
age gender split final split_sec final_sec split_frac
0 33 M 01:05:38 02:08:51 3938.0 7731.0 -0.018756
1 32 M 01:06:26 02:09:28 3986.0 7768.0 -0.026262
2 31 M 01:06:49 02:10:42 4009.0 7842.0 -0.022443
In [185]:
# Let's see whether there is any correlation between this split fraction and other variables. We'll do this using a pairgrid,
g = sns.PairGrid(data, vars=['age', 'split_sec', 'final_sec', 'split_frac'],
                 hue='gender', palette='RdBu_r')
g.map(plt.scatter, alpha=0.8)
g.add_legend();
In [186]:
# The difference between men and women here is interesting. Let's look at the histogram of split fractions for these two groups:
sns.kdeplot(data.split_frac[data.gender=='M'], label='men', shade=True)
sns.kdeplot(data.split_frac[data.gender=='W'], label='women', shade=True)
plt.xlabel('split_frac');
# There are many more men than women who are running close to an even split! 
In [187]:
# Back to the men with negative splits: who are these runners? Does this split fraction correlate with finishing quickly
g = sns.lmplot('final_sec', 'split_frac', col='gender', data=data,
               markers=".", scatter_kws=dict(color='c'))
g.map(plt.axhline, y=0.1, color="k", ls=":");

Machine Learning package in Python P331

Machine learning involves building statistical models to help understand data. “Learning” enters the fray when we give these models tunable parameters that can be adapted to observed data; in this way the program can be considered to be “learning” from the data.
Once these models have been fit to previously seen data, they can be used to predict and understand aspects of newly observed data.

Category

  • Supervised learning involves somehow modeling the relationship between measured features of data and some label associated with the data; once this model is determined, it can be used to apply labels to new, unknown data This is further subdivided into classification tasks and regression tasks:
  • Unsupervised learning involves modeling the features of a dataset without reference to any label, and is often described as “letting the dataset speak for itself.” These models include tasks such as clustering and dimensionality reduction.

Data Representation in Scikit-Learn

In [ ]:
# refer the rows of the matrix as samples, and the number of rows as n_samples.
# refer the columns of the matrix as features, and the number of columns as n_features.
In [14]:
iris = sns.load_dataset('iris')
iris.head() # species as a target array or outcomes
Out[14]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [7]:
sns.set()
sns.pairplot(iris, hue='species', size=1.5);
In [15]:
# extract features and outcome from data set as x and y
X_iris = iris.drop('species',axis='columns')
y_iris = iris['species']
X_iris.shape , y_iris.shape
Out[15]:
((150, 4), (150,))

Scikit-Learn’s Estimator API

Basics of the API the steps in using the Scikit-Learn estimator API are as follows

  1. Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
  2. Choose model hyperparameters by instantiating this class with desired values.
  3. Arrange data into a features matrix and target vector following the discussion from before.
  4. Fit the model to your data by calling the fit( ) method of the model instance.
  5. Apply the model to new data:

    • For supervised learning, often we predict labels for unknown data using the predict( ) method.
    • For unsupervised learning, we often transform or infer properties of the data using the transform( ) or predict( ) method.

In [23]:
# Sample with Linear regression
rng = np.random.RandomState(42)
x = 10 * rng.rand(50)
y = 2 * x - 1 + rng.randn(50)
plt.scatter(x, y);

1. Choose a class of model.

In [24]:
# in this case choose linear regression by import the linear regression class:
from sklearn.linear_model import LinearRegression

2.Choose Model hyperparameters

might need to answer one or more questions like the following:

• Would we like to fit for the offset (i.e., intercept)?
• Would we like the model to be normalized?
• Would we like to preprocess our features to add model flexibility?
• What degree of regularization would we like to use in our model?
• How many model components would we like to use?

These choices are often represented as hyperparameters, or parameters that must be set before the model is fit to data

In [25]:
# For linear regression, we can instantiate the LinearRegression class and 
# fit the intercept using the fit_intercept hyperparameter:
model = LinearRegression(fit_intercept=True)
model
# the only action here is the storing of these hyperparameter values
Out[25]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

3. Arrange data into a features matrix and target vector.

Y (outcome, target array) must be in form (n_samples, )
X (features, predictors) must be in form (n_samples, n_features)

In [26]:
# change x form
X = x[:,np.newaxis] # transpose
X.shape
Out[26]:
(50, 1)

4. Fit the model to your data.

This can be done with the fit( ) method

In [27]:
model.fit(X,y)
Out[27]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [28]:
# get data from fitted model -- model.<TAB>
model.coef_
Out[28]:
array([1.9776566])

5. Predict labels for unknown data.

using the predict( ) method

In [29]:
# create sample data fro unknown outcomes
xfit = np.linspace(-1, 11)
# change form
Xfit = xfit[:,np.newaxis]
# find outcome from new data
yfit = model.predict(Xfit)
In [17]:
# lets see the fitted line
plt.scatter(x, y)
plt.plot(xfit, yfit); # fitted line

Supervised learning example: Iris classification with Naive Bayes Classification

In [18]:
# 1. split data for training and validate with train_test_split
from sklearn.model_selection import train_test_split

Xtrain , Xtest, ytrain, ytest = train_test_split(X_iris, y_iris , random_state =1)
print(Xtrain.shape) ;print(Xtest.shape) ;print(ytrain.shape) ;print(ytest.shape) ;
(112, 4)
(38, 4)
(112,)
(38,)
In [30]:
# go through step
from sklearn.naive_bayes import GaussianNB  # 1. choose model class
model = GaussianNB()                        # 2. instantiate model
model.fit(Xtrain, ytrain)                   # 3. fit model to data
y_model = model.predict(Xtest)              # 4. predict on new data
In [69]:
# evaluate model with accuracy_score
from sklearn.metrics import accuracy_score
print('accuracy score of', accuracy_score(ytest,y_model))
accuracy score of 0.9736842105263158

Other evaluation and report method

In [32]:
# classification report evaluate , print(classification_report(Y_actual,Y_predict))
from sklearn.metrics import classification_report
print(classification_report(ytest,y_model))
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        13
  versicolor       1.00      0.94      0.97        16
   virginica       0.90      1.00      0.95         9

    accuracy                           0.97        38
   macro avg       0.97      0.98      0.97        38
weighted avg       0.98      0.97      0.97        38

In [33]:
# confusion matrix, pd.crosstab(Y_predict Y_actual, rownames=['Predicted'], colnames=['Actual'], margins=True)
pd.crosstab(y_model,ytest, rownames=['Predicted'], colnames=['Actual'], margins=True)
Out[33]:
Actual setosa versicolor virginica All
Predicted
setosa 13 0 0 13
versicolor 0 15 0 15
virginica 0 1 9 10
All 13 16 9 38
In [34]:
from sklearn.metrics import confusion_matrix

mat = confusion_matrix(ytest, y_model)

sns.heatmap(mat, square=True, annot=True, cbar=False)
plt.xlabel('predicted value')
plt.ylabel('true value');
In [80]:
# useful summary report for regression model , regression_results(Y_actual,Y_predict)
import sklearn.metrics as metrics
def regression_results(y_true, y_pred):

    # Regression metrics
    explained_variance=metrics.explained_variance_score(y_true, y_pred)
    mean_absolute_error=metrics.mean_absolute_error(y_true, y_pred) 
    mse=metrics.mean_squared_error(y_true, y_pred) 
    mean_squared_log_error=metrics.mean_squared_log_error(y_true, y_pred)
    median_absolute_error=metrics.median_absolute_error(y_true, y_pred)
    r2=metrics.r2_score(y_true, y_pred)

    print('explained_variance: ', round(explained_variance,4))    
    print('mean_squared_log_error: ', round(mean_squared_log_error,4))
    print('r2: ', round(r2,4))
    print('MAE: ', round(mean_absolute_error,4))
    print('MSE: ', round(mse,4))
    print('RMSE: ', round(np.sqrt(mse),4))
In [ ]:
#report for intercept and coefficient
params = pd.Series(model.coef_, index=X.columns)
params
# estimate effect and error from bootstrap resample
from sklearn.utils import resample
np.random.seed(1)
err = np.std([model.fit(*resample(X, y)).coef_
              for i in range(1000)], 0)
# With these errors estimated, let's again look at the results:
print(pd.DataFrame({'effect': params.round(0),
                     'error': err.round(0)}))

Unsupervised learning example : Iris

reducing the dimensionality of the Iris data so as to more easily visualize it.
from 4 dimension we will ask PCA to return only 2 components

Ex1 .Iris reducing dimensionality with PCA

In [81]:
from sklearn.decomposition import PCA     # 1. Choose the model class
model = PCA(n_components=2)               # 2. Instantiate the model with hyperparameters
model.fit(X_iris)                         # 3. Fit to data. Notice y is not specified!
X_2D = model.transform(X_iris)            # 4. Transform the data to two dimensions
In [83]:
# let’s plot the results.
iris['PCA1'] = X_2D[:, 0]
iris['PCA2'] = X_2D[:, 1]
sns.lmplot("PCA1", "PCA2", hue='species', data=iris, fit_reg=False);

Ex2 .Iris clustering with Gaussian mixture model (GMM)

In [85]:
from sklearn.mixture import GaussianMixture                    # 1. Choose the model class
model = GaussianMixture(n_components=3,covariance_type='full') # 2. Instantiate the model w/ hyperparameters
model.fit(X_iris)                                              # 3. Fit to data. Notice y is not specified!
y_gmm = model.predict(X_iris)                                  # 4. Determine cluster labels
In [87]:
iris['cluster'] = y_gmm
sns.lmplot("PCA1", "PCA2", data=iris, hue='species',col='cluster', fit_reg=False);

Hyperparameters and Model Validation

Model Validation

In [222]:
# sample some data
from sklearn import datasets
X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape
Out[222]:
((150, 4), (150,))

Holdout sets

using the train_test_split utility in Scikit-Learn to split data into training and validating
The holdout set is similar to unknown data, because the model has not “seen” it before.

In [257]:
from sklearn.model_selection import train_test_split
# split the data with 50% in each set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
# fit the model on one set of data
model = GaussianNB() 
model.fit(X_train, y_train)
# evaluate the model on the second set of data
y2_fitted = model.predict(X_test)
accuracy_score(y_test, y2_fitted)
Out[257]:
0.9466666666666667

Cross Validation

One disadvantage of using a holdout set for model validation is that we have lost a portion of our data to the model training.
cross-validation is, to do a sequence of fits where each subset of the data is used both as a training set and as a validation set.

In [259]:
# Using the split data from before, we could implement it like this:
y2_model = model.fit(X_train, y_train).predict(X_test)
y1_model = model.fit(X_test, y_test).predict(X_train)
accuracy_score(y1, y1_model), accuracy_score(y_test, y2_fitted)
# we could average both to get a better measure of the global model performance.
Out[259]:
(0.9733333333333334, 0.9466666666666667)
In [260]:
# 5-fold cross validation,we can use Scikit-Learn’s cross_val_score convenience routine to do itsuccinctly:
from sklearn.model_selection import cross_val_score
cross_val_score(model, X, y, cv=5) # then take the mean score
Out[260]:
array([0.93333333, 0.96666667, 0.93333333, 0.93333333, 1.        ])

Selecting the Best Model

if our estimator is underperforming how should we move forward?

• Use a more complicated/more flexible model
• Use a less complicated/less flexible model
• Gather more training samples
• Gather more data to add features to each sample

Validation curves in Scikit-Learn : Consider Variance ~ Bias Variance trade-off model

In [264]:
# # Here we will use a polynomial regression model: this is a generalized
# linear model in which the degree of the polynomial is a tunable parameter
# ie, defree 1 is linear and A degree-3 polynomial fits a cubic curve to the data; y = ax^3 + bx^2 + cx + d
from sklearn.preprocessing import PolynomialFeatures #polynomial
from sklearn.linear_model import LinearRegression #linear regression
from sklearn.pipeline import make_pipeline #use a pipeline to string these operations together

def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree),
    LinearRegression(**kwargs))
In [270]:
# create some data to which we will fit our model
def make_data(N, err=1.0, rseed=1):
    # randomly sample the data
    rng = np.random.RandomState(rseed)
    X = rng.rand(N, 1) ** 2
    y = 10 - 1. / (X.ravel() + 0.1)
    if err > 0:
        y += err * rng.randn(N)
    return X, y
X, y = make_data(40)
In [271]:
X_test = np.linspace(-0.1, 1.1, 500)[:, None]
# visualize our data, along with polynomial fits of several degrees
plt.scatter(X.ravel(), y, color='black')
axis = plt.axis()
for degree in [1, 3, 5]:
    y_test = PolynomialRegression(degree).fit(X, y).predict(X_test)
    plt.plot(X_test.ravel(), y_test, label='degree={0}'.format(degree))
plt.xlim(-0.1, 1.0)
plt.ylim(-2, 12)
plt.legend(loc='best');
# A useful question to answer is this: what  degree of polynomial provides a suitable 
# trade-off between bias (underfitting) and variance (overfitting)?
In [275]:
# decided by visualizing the validation curve for this particular
# data and model; we can do this straightforwardly using the validation_curve
from sklearn.model_selection import validation_curve
degree = np.arange(0, 21)
# Given a model, data, parameter name, and a range to explore, this function will automatically 
# compute both the training score and validation score across the range
train_score, val_score = validation_curve(PolynomialRegression(), X, y,'polynomialfeatures__degree',degree, cv=7)

plt.plot(degree, np.median(train_score, 1), color='blue', label='training score')
plt.plot(degree, np.median(val_score, 1), color='red', label='validation score')
plt.legend(loc='best')
plt.ylim(0, 1)
plt.xlabel('degree')
plt.ylabel('score');
# we can see that degree 3 gives the best training and validation score
In [282]:
# we can compute degree 3 and display this fit over the original data as follows
plt.scatter(X.ravel(), y)
lim = plt.axis()
y_test = PolynomialRegression(3).fit(X, y).predict(X_test)
plt.plot(X_test.ravel(), y_test);
plt.axis(lim);

Learning Curves

In [283]:
# Try with bigger data
X2, y2 = make_data(200)
plt.scatter(X2.ravel(), y2);
In [285]:
# try validation curve for this larger dataset;
egree = np.arange(21)
train_score2, val_score2 = validation_curve(PolynomialRegression(), X2, y2,'polynomialfeatures__degree',degree, cv=7)
plt.plot(degree, np.median(train_score2, 1), color='blue',label='training score')
plt.plot(degree, np.median(val_score2, 1), color='red', label='validation score')
plt.plot(degree, np.median(train_score, 1), color='blue', alpha=0.3,linestyle='dashed')
plt.plot(degree, np.median(val_score, 1), color='red', alpha=0.3,linestyle='dashed')
plt.legend(loc='lower center')
plt.ylim(0, 1)
plt.xlabel('degree')
plt.ylabel('score');
#The solid lines show the new results, while the fainter dashed lines show the results of the previous smaller dataset.
# It is clear from the validation curve that the larger dataset can support a much more complicated model
# more data help reduce overfit for complex model because validatopn score keep converge to training
# but once you have enough points and a model has converged, adding more training data will not help you!

Learning curves in Scikit-Learn

In [287]:
from sklearn.model_selection import learning_curve
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)
for i, degree in enumerate([2, 9]):
    N, train_lc, val_lc = learning_curve(PolynomialRegression(degree),X, y, cv=7,train_sizes=np.linspace(0.3, 1, 25))
    ax[i].plot(N, np.mean(train_lc, 1), color='blue', label='training score')
    ax[i].plot(N, np.mean(val_lc, 1), color='red', label='validation score')
    ax[i].hlines(np.mean([train_lc[-1], val_lc[-1]]), N[0], N[-1], color='gray',linestyle='dashed')
    ax[i].set_ylim(0, 1)
    ax[i].set_xlim(N[0], N[-1])
    ax[i].set_xlabel('training size')
    ax[i].set_ylabel('score')
    ax[i].set_title('degree = {0}'.format(degree), size=14)
    ax[i].legend(loc='best')
#it gives us a visual depiction of how our model responds to increasing training data.

Validation with Grid Search : just get the best model

In muti-dimension cases, visualizations are difficult and we would rather simply find the particular model that maximizes the validation score.

In [290]:
from sklearn.model_selection import GridSearchCV

param_grid = {'polynomialfeatures__degree': np.arange(21),
         'linearregression__fit_intercept': [True, False],
             'linearregression__normalize': [True, False]}

grid = GridSearchCV(PolynomialRegression(), param_grid, cv=7)
In [292]:
grid.fit(X, y);
In [293]:
grid.best_params_
Out[293]:
{'linearregression__fit_intercept': False,
 'linearregression__normalize': True,
 'polynomialfeatures__degree': 4}
In [296]:
model = grid.best_estimator_

plt.scatter(X.ravel(), y)
lim = plt.axis()
y_test = model.fit(X, y).predict(X_test)
plt.plot(X_test.ravel(), y_test);
plt.axis(lim);

Feature Engineering

taking whatever information you have about your problem and turning it into numbers that you can use to build your feature matrix.

Categorical Features

In [299]:
# explore room information will have neriborhood as categorical
data = [{'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
        {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
        {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
        {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}]
In [302]:
# use one-hot encoding with dummy variables of 0 and 1
# with Scikit-Learn’s DictVectorizer
from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)
Out[302]:
array([[     0,      1,      0, 850000,      4],
       [     1,      0,      0, 700000,      3],
       [     0,      0,      1, 650000,      3],
       [     1,      0,      0, 600000,      2]])
In [303]:
# see the meaning of each column 
vec.get_feature_names()
Out[303]:
['neighborhood=Fremont',
 'neighborhood=Queen Anne',
 'neighborhood=Wallingford',
 'price',
 'rooms']

Text Features

One of the simplest methods of encoding data is by word counts: you take each snippet of text, count the occurrences of each word within it, and put the results in a table.

In [310]:
sample = ['problem of evil',
          'evil queen',
          'horizon problem']
# construct a column representing each word “problem,” the word “evil,”. etc so on using Scikit-Learn’s CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
Out[310]:
evil horizon of problem queen
0 1 0 1 1 0
1 1 0 0 0 1
2 0 1 0 1 0
In [311]:
# better approach is term frequency–inverse document frequency (TF–IDF),
# which weights the word counts by a measure of how often they appear in the documents.
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer()
X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
Out[311]:
evil horizon of problem queen
0 0.517856 0.000000 0.680919 0.517856 0.000000
1 0.605349 0.000000 0.000000 0.000000 0.795961
2 0.000000 0.795961 0.000000 0.605349 0.000000

Imputation of Missing Data

need to first replace such missing data with some appropriate fill value with Scikit-Learn Imputer

In [313]:
X = np.array([  [ np.nan, 0, 3 ],
                [ 3, 7, 9 ],
                [ 3, 5, 2 ],
                [ 4, np.nan, 6 ],
                [ 8, 8, 1 ]])
y = np.array([14, 16, -1, 8, -5])
Out[313]:
array([[nan,  0.,  3.],
       [ 3.,  7.,  9.],
       [ 3.,  5.,  2.],
       [ 4., nan,  6.],
       [ 8.,  8.,  1.]])
In [316]:
from sklearn.impute import SimpleImputer

imp = SimpleImputer(strategy='mean')
X2 = imp.fit_transform(X)
X2 #the two missing values have been replaced with the mean of the remaining values in the column.
Out[316]:
array([[4.5, 0. , 3. ],
       [3. , 7. , 9. ],
       [3. , 5. , 2. ],
       [4. , 5. , 6. ],
       [8. , 8. , 1. ]])

Feature Pipelines

  1. Impute missing values using the mean
  2. Transform features to quadratic
  3. Fit a linear regression

Scikit-Learn provides a pipeline object, which can be used as follows:

In [321]:
from sklearn.pipeline import make_pipeline

model = make_pipeline(SimpleImputer(strategy='mean'),PolynomialFeatures(degree=2),LinearRegression())
# This pipeline looks and acts like a standard Scikit-Learn object, and will apply all the specified steps to any input data
model.fit(X, y) # X with missing values, from above
print(y)
print(model.predict(X))
[14 16 -1  8 -5]
[14. 16. -1.  8. -5.]

Naive Bayes Classification

Naive Bayes models are a group of extremely fast and simple classification algorithms that are often suitable for very high-dimensional datasets.
Because they are so fast and have so few tunable parameters, they end up being very useful as a quick-anddirty baseline for a classification problem.

the assumption is that data from each label is drawn from a simple Gaussian distribution. (independent)
We can fit this model by simply finding the mean and standard deviation of the points within each labe

In [323]:
from sklearn.datasets import make_blobs
X, y = make_blobs(100, 2, centers=2, random_state=2, cluster_std=1.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu');

Gaussian Naive Bayes

In [324]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X, y);
In [325]:
# generate some new data and predict the label:
rng = np.random.RandomState(0)
Xnew = [-6, -14] + [14, 18] * rng.rand(2000, 2)
ynew = model.predict(Xnew)
In [326]:
# see the data boundary
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu')
lim = plt.axis()
plt.scatter(Xnew[:, 0], Xnew[:, 1], c=ynew, s=20, cmap='RdBu', alpha=0.1)
plt.axis(lim);
In [327]:
# Compute probability  of first and second label
yprob = model.predict_proba(Xnew)
yprob[-8:].round(2)
Out[327]:
array([[0.89, 0.11],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [0.  , 1.  ],
       [0.15, 0.85]])

Multinomial Naive Bayes

multinomial naive Bayes, where the features are assumed to be generated from a simple multinomial distribution.
It describes the probability of observing counts among a number of categories, and thus multinomial naive Bayes is most appropriate for features that represent counts or count rates.

In [35]:
# Example : Classify text
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()
data.target_names
Out[35]:
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
In [36]:
# get train and test set
categories = ['talk.religion.misc', 'soc.religion.christian', 'sci.space','comp.graphics']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)
In [37]:
# sample data in dict form
print(train.data[5])
From: dmcgee@uluhe.soest.hawaii.edu (Don McGee)
Subject: Federal Hearing
Originator: dmcgee@uluhe
Organization: School of Ocean and Earth Science and Technology
Distribution: usa
Lines: 10


Fact or rumor....?  Madalyn Murray O'Hare an atheist who eliminated the
use of the bible reading and prayer in public schools 15 years ago is now
going to appear before the FCC with a petition to stop the reading of the
Gospel on the airways of America.  And she is also campaigning to remove
Christmas programs, songs, etc from the public schools.  If it is true
then mail to Federal Communications Commission 1919 H Street Washington DC
20054 expressing your opposition to her request.  Reference Petition number

2493.

In [38]:
# we will use the TF–IDF vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

model = make_pipeline(TfidfVectorizer(), MultinomialNB())

model.fit(train.data, train.target)
labels = model.predict(test.data)
In [39]:
# evaluate with confusion_matrix
from sklearn.metrics import confusion_matrix


mat = confusion_matrix(test.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');
In [404]:
# create function for predict whether this phrase is about?
def predict_category(s, train=train, model=model):
    pred = model.predict([s])
    return train.target_names[pred[0]]
# Lets try predict
print(predict_category('sending a payload to the ISS'))
print(predict_category('determining the screen resolution'))
sci.space
comp.graphics

When to Use Naive Bayes

Because naive Bayesian classifiers make such stringent assumptions about data, they will generally not perform as well as a more complicated model. That said, they have several advantages:

  • They are extremely fast for both training and prediction
  • They provide straightforward probabilistic prediction
  • They are often very easily interpretable
  • They have very few (if any) tunable parameters

These advantages mean a naive Bayesian classifier is often a good choice as an initial baseline classification. If it does not perform well, then begin exploring more sophisticated models, with some baseline knowledge of how well they should perform.

Naive Bayes classifiers tend to perform especially well in one of the following situations:

  • When the naive assumptions actually match the data (very rare in practice)
  • For very well-separated categories, when model complexity is less important
  • For very high-dimensional data, when model complexity is less important

Naive Bayes tend to work as well or better than more complicated classifiers as the dimensionality grows: once you have enough data, even a simple model can be very powerful.

Linear Regression

Simple Linear Regression

In [416]:
# sample 
rng = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = 2 * x - 5 + rng.randn(50)
plt.scatter(x, y);
x = x[:,np.newaxis]
In [407]:
# use Scikit-Learn’s LinearRegression estimator to fit this data
from sklearn.linear_model import  LinearRegression

model = LinearRegression(fit_intercept=True)
# fit data
model.fit(x,y)
# create new data and find y_fitted 
xfit= np.linspace(0,10,1000)[:,np.newaxis]
yfit = model.predict(xfit)
# plot
plt.scatter(x,y)
plt.plot(xfit,yfit);
In [412]:
# print relevant parameter
print("Model slope: ", model.coef_[0])
print("Model intercept:", model.intercept_)
Model slope:  2.027208810360695
Model intercept: -4.998577085553202

Regularization

penalizing large values of the model parameters. Such a penalty is known as regularization, and comes in several forms.

  • Ridge regression (L2 regularization)

    • This proceeds by penalizing the sum of squares (2-norms) of the model coefficients
  • Lasso regularization (L1)

    • penalizing the sum of absolute values (1-norms) of regression coefficients
    • due to geometric reasons lasso regression tends to favor sparse models where possible; that is, it preferentially sets model coefficients to exactly zero
In [419]:
# Ridge L2
from sklearn.linear_model import Ridge

model = make_pipeline(GaussianFeatures(30), Ridge(alpha=0.1))
basis_plot(model, title='Ridge Regression')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-419-49a9e69e2a01> in <module>
      1 from sklearn.linear_model import Ridge
      2 
----> 3 model = make_pipeline(GaussianFeatures(30), Ridge(alpha=0.1))
      4 basis_plot(model, title='Ridge Regression')

NameError: name 'GaussianFeatures' is not defined
In [420]:
# L1
from sklearn.linear_model import Lasso
model = make_pipeline(GaussianFeatures(30), Lasso(alpha=0.001))
basis_plot(model, title='Lasso Regression')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-420-30db4247a916> in <module>
      1 # L1
      2 from sklearn.linear_model import Lasso
----> 3 model = make_pipeline(GaussianFeatures(30), Lasso(alpha=0.001))
      4 basis_plot(model, title='Lasso Regression')

NameError: name 'GaussianFeatures' is not defined

Example: Predicting Bicycle Traffic

predict the number of bicycle trips across Seattle’s Fremont Bridge based on weather, season, and other factors.

In [7]:
# start by load data
!curl -o FremontBridge.csv https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD
    
counts = pd.read_csv('FremontBridge.csv', index_col='Date', parse_dates=True)
weather = pd.read_csv('data/BicycleWeather.csv', index_col='DATE', parse_dates=True)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 64700    0 64700    0     0  37205      0 --:--:--  0:00:01 --:--:-- 37205
100  735k    0  735k    0     0   268k      0 --:--:--  0:00:02 --:--:--  268k
100 1439k    0 1439k    0     0   385k      0 --:--:--  0:00:03 --:--:--  385k
100 1956k    0 1956k    0     0   433k      0 --:--:--  0:00:04 --:--:--  433k
In [80]:
# compute the total daily bicycle traffic
daily = counts.resample('d').sum()
daily['Total'] =  daily.sum(axis=1)
daily = daily[['Total']] # remove other columns
daily.head(3)
Out[80]:
Total
Date
2012-10-03 7042.0
2012-10-04 6950.0
2012-10-05 6296.0
In [81]:
# We saw that the patterns of use generally vary from day to day
sns.set_style("darkgrid")
sns.distplot(daily);
In [82]:
# let's account for this in our data by adding binary columns that indicate the day of the week:
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
for i in range(7):
    daily[days[i]] = (daily.index.dayofweek == i).astype(float)
daily.head(3)
Out[82]:
Total Mon Tue Wed Thu Fri Sat Sun
Date
2012-10-03 7042.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
2012-10-04 6950.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
2012-10-05 6296.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
In [83]:
# Similarly, we might expect riders to behave differently on holidays; let's add an indicator of this as well:
from pandas.tseries.holiday import USFederalHolidayCalendar

cal = USFederalHolidayCalendar()
holidays = cal.holidays('2012', '2016')
daily = daily.join(pd.Series(1, index=holidays, name='holiday'))
daily['holiday'].fillna(0, inplace=True)
daily.head(3)
Out[83]:
Total Mon Tue Wed Thu Fri Sat Sun holiday
Date
2012-10-03 7042.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
2012-10-04 6950.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
2012-10-05 6296.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
In [84]:
#We also might suspect that the hours of daylight would affect how many people ride; 
# let's use the standard astronomical calculation to add this information:
def hours_of_daylight(date, axis=23.44, latitude=47.61):
    """Compute the hours of daylight for the given date"""
    days = (date - pd.datetime(2000, 12, 21)).days
    m = (1. - np.tan(np.radians(latitude))
         * np.tan(np.radians(axis) * np.cos(days * 2 * np.pi / 365.25)))
    return 24. * np.degrees(np.arccos(1 - np.clip(m, 0, 2))) / 180.

daily['daylight_hrs'] = list(map(hours_of_daylight, daily.index))
daily[['daylight_hrs']].plot()
plt.ylim(8, 17)
Out[84]:
(8, 17)
In [85]:
#We can also add the average temperature and total precipitation to the data. 
# In addition to the inches of precipitation, let's add a flag that indicates whether a day is dry (has zero precipitation):
weather['Temp (C)'] = 0.5 * (weather['TMIN'] + weather['TMAX'])*10000 # average min and max temp * make it normal
# precip is in 1/10 mm; convert to inches
weather['PRCP'] /= 254 # precip
weather['dry day'] = (weather['PRCP'] == 0).astype(int)

daily = daily.join(weather[['PRCP', 'Temp (C)', 'dry day']])
daily.head(3)
Out[85]:
Total Mon Tue Wed Thu Fri Sat Sun holiday daylight_hrs PRCP Temp (C) dry day
Date
2012-10-03 7042.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 11.277359 0.0 13.35 1.0
2012-10-04 6950.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 11.219142 0.0 13.60 1.0
2012-10-05 6296.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 11.161038 0.0 15.30 1.0
In [86]:
# Let's add a counter that increases from day 1, and measures how many years have passed. 
# This will let us measure any observed annual increase or decrease in daily crossings:
daily['annual'] = (daily.index - daily.index[0]).days / 365.
daily.head()
Out[86]:
Total Mon Tue Wed Thu Fri Sat Sun holiday daylight_hrs PRCP Temp (C) dry day annual
Date
2012-10-03 7042.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 11.277359 0.0 13.35 1.0 0.000000
2012-10-04 6950.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 11.219142 0.0 13.60 1.0 0.002740
2012-10-05 6296.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 11.161038 0.0 15.30 1.0 0.005479
2012-10-06 4012.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 11.103056 0.0 15.85 1.0 0.008219
2012-10-07 4284.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 11.045208 0.0 15.85 1.0 0.010959
In [88]:
# Drop any rows with null values
# choose the columns to use, and fit a linear regression model to our data. We will set fit_intercept = False, 
# because the daily flags essentially operate as their own day-specific intercepts:
daily.dropna(axis=0, how='any', inplace=True)

column_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'holiday','daylight_hrs', 'PRCP', 'dry day', 'Temp (C)', 'annual']
X = daily[column_names]
y = daily['Total']
# fit model 
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=False)
model.fit(X, y)
daily['predicted'] = model.predict(X)
In [89]:
# Finally, we can compare the total and predicted bicycle traffic visually:
daily[['Total', 'predicted']].plot(alpha=0.5);

It is evident that we have missed some key features, especially during the summer time.

Either our features are not complete (i.e., people decide whether to ride to work based on more than just these) or there are some nonlinear relationships that we have failed to take into account (e.g., perhaps people ride less at both high and low temperatures).

Nevertheless, our rough approximation is enough to give us some insights, and we can take a look at the coefficients of the linear model to estimate how much each feature contributes to the daily bicycle count:

In [91]:
params = pd.Series(model.coef_, index=X.columns)
params
Out[91]:
Mon             5.705256e+02
Tue             7.800426e+02
Wed             7.187231e+02
Thu             5.273547e+02
Fri            -9.224036e+01
Sat            -2.691453e+03
Sun            -2.747933e+03
holiday        -2.391003e+03
daylight_hrs    2.699949e+02
PRCP            2.050970e-13
dry day         1.460658e+03
Temp (C)        1.260226e+02
annual          5.172789e+01
dtype: float64
In [92]:
# These numbers are difficult to interpret without some measure of their uncertainty. 
# We can compute these uncertainties quickly using bootstrap resamplings of the data:
from sklearn.utils import resample
np.random.seed(1)
err = np.std([model.fit(*resample(X, y)).coef_
              for i in range(1000)], 0)
In [93]:
# With these errors estimated, let's again look at the results:
print(pd.DataFrame({'effect': params.round(0),
                     'error': err.round(0)}))
              effect  error
Mon            571.0  179.0
Tue            780.0  171.0
Wed            719.0  171.0
Thu            527.0  174.0
Fri            -92.0  168.0
Sat          -2691.0  164.0
Sun          -2748.0  171.0
holiday      -2391.0  332.0
daylight_hrs   270.0   19.0
PRCP             0.0    0.0
dry day       1461.0   62.0
Temp (C)       126.0    8.0
annual          52.0   37.0

We first see that there is a relatively stable trend in the weekly baseline: there are many more riders on weekdays than on weekends and holidays.
We see that for each additional hour of daylight, 270 ± 19 more people choose to ride; a temperature increase of one degree Celsius encourages 126 ± 8 people to grab their bicycle; a dry day means an average of 1461 ± 62 more riders, and each inch of precipitation means 0 ± 0 more people leave their bike at home.
Once all these effects are accounted for, we see a modest increase of 52 ± 37 new daily riders each year.

Our model is almost certainly missing some relevant information. For example, nonlinear effects (such as effects of precipitation and cold temperature) and nonlinear trends within each variable (such as disinclination to ride at very cold and very hot temperatures) cannot be accounted for in this model.
Additionally, we have thrown away some of the finer-grained information (such as the difference between a rainy morning and a rainy afternoon), and we have ignored correlations between days (such as the possible effect of a rainy Tuesday on Wednesday's numbers, or the effect of an unexpected sunny day after a streak of rainy days). These are all potentially interesting effects, and you now have the tools to begin exploring them if you wish!

Support Vector Machines

While Bayesian classification describing the distribution of each underlying class, and used these generative models to probabilistically determine labels for new points. That was an example of generative classification;

Here we will consider instead discriminative classification: rather than modeling each class, we simply find a line or curve (in two dimensions) or manifold (in multiple dimensions) that divides the classes from each other.

In [25]:
# sample with well distance 2 class of dataset
from scipy import stats
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=50, centers=2,
        random_state=0, cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn');

Support Vector Machines: Maximizing the Margin

A linear discriminative classifier would attempt to draw a straight line separating the two sets of data, and thereby create a model for classification.

In Support Vector Machines, we can draw around each line a margin of some width, up to the nearest point

In [26]:
xfit = np.linspace(-1, 3.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:
    yfit = m * xfit + b
    plt.plot(xfit, yfit, '-k')
    plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none', color='#AAAAAA',alpha=0.4)
plt.xlim(-1, 3.5);

In support vector machines, the line that maximizes this margin is the one we will choose as the optimal model. Support vector machines are an example of such a maximum margin estimator.

Fitting a support vector machine

Use Scikit-Learn's support vector classifier to train an SVM model on this data. For the time being, we will use a linear kernel and set the C parameter to a very large number

In [7]:
from sklearn.svm import SVC # "Support vector classifier"
model = SVC(kernel='linear', C=1E10)
model.fit(X, y)
Out[7]:
SVC(C=10000000000.0, break_ties=False, cache_size=200, class_weight=None,
    coef0=0.0, decision_function_shape='ovr', degree=3, gamma='scale',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

To better visualize what's happening here, let's create a quick convenience function that will plot SVM decision boundaries for us:

In [21]:
def plot_svc_decision_function(model, ax=None, plot_support=True):
    """Plot the decision function for a 2D SVC"""
    if ax is None:
        ax = plt.gca()
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    
    # create grid to evaluate model
    x = np.linspace(xlim[0], xlim[1], 30)
    y = np.linspace(ylim[0], ylim[1], 30)
    Y, X = np.meshgrid(y, x)
    xy = np.vstack([X.ravel(), Y.ravel()]).T
    P = model.decision_function(xy).reshape(X.shape)
    
    # plot decision boundary and margins
    ax.contour(X, Y, P, colors='k',
               levels=[-1, 0, 1], alpha=0.5,
               linestyles=['--', '-', '--'])
    
    # plot support vectors
    if plot_support:
        ax.scatter(model.support_vectors_[:, 0],
                   model.support_vectors_[:, 1],
                   s=300, linewidth=1, facecolors='none');
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)
In [27]:
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(model);

This is the dividing line that maximizes the margin between the two sets of points. Notice that a few of the training points just touch the margin: they are indicated by the black circles in this figure. These points are the pivotal elements of this fit, and are known as the support vectors, and give the algorithm its name. In Scikit-Learn, the identity of these points are stored in the supportvectors attribute of the classifier:

In [11]:
model.support_vectors_
Out[11]:
array([[0.44359863, 3.11530945],
       [2.33812285, 3.43116792],
       [2.06156753, 1.96918596]])

Beyond linear boundaries: Kernel SVM , when working with no linear relationship

Where SVM becomes extremely powerful is when it is combined with kernels. such as polynomials in LinearRegression

In [35]:
# let's look at some data that is not linearly separable:
from sklearn.datasets import make_circles
X, y = make_circles(100, factor=.1, noise=.1)

clf = SVC(kernel='linear').fit(X, y)

plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(clf, plot_support=False);

Project the data into a higher dimension to compute a radial basis function centered on the middle clump:

In [57]:
r = np.exp(-(X ** 2).sum(1))
In [69]:
%matplotlib inline

from mpl_toolkits import mplot3d

def plot_3D(elev=30, azim=30, X=X, y=y):
    ax = plt.subplot(projection='3d')
    ax.scatter3D(X[:, 0], X[:, 1], r, c=y, s=50, cmap='autumn')
    ax.view_init(elev=elev, azim=azim)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_zlabel('r')

interact(plot_3D, elev=[25,10], azip=(-180, 180),X=fixed(X), y=fixed(y));

We can see that with this additional dimension, the data becomes trivially linearly separable, by drawing a separating plane at, say, r=0.7.

we would like to somehow automatically find the best basis functions to use.

One strategy to this end is to compute a basis function centered at every point in the dataset, and let the SVM algorithm sift through the results. This type of basis function transformation is known as a kernel transformation, as it is based on a similarity relationship (or kernel) between each pair of points.

In [42]:
clf = SVC(kernel='rbf', C=1E6)
clf.fit(X, y)
Out[42]:
SVC(C=1000000.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
In [71]:
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(clf)
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
            s=300, lw=1, facecolors='none');

Tuning the SVM: Softening Margins

In [72]:
# where data is blended
X, y = make_blobs(n_samples=100, centers=2,random_state=0, cluster_std=1.2)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn');

To handle this case, the SVM implementation has a bit of a fudge-factor which "softens" the margin: that is, it allows some of the points to creep into the margin if that allows a better fit.

For very large $C$, the margin is hard, and points cannot lie in it. For smaller $C$, the margin is softer, and can grow to encompass some points.

In [73]:
# Visual picture of how a changing $C$ parameter affects the final fit
X, y = make_blobs(n_samples=100, centers=2,random_state=0, cluster_std=0.8)

fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)

for axi, C in zip(ax, [10.0, 0.1]):
    model = SVC(kernel='linear', C=C).fit(X, y)
    axi.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
    plot_svc_decision_function(model, axi)
    axi.scatter(model.support_vectors_[:, 0],
                model.support_vectors_[:, 1],
                s=300, lw=1, facecolors='none');
    axi.set_title('C = {0:.1f}'.format(C), size=14)

Example: Face Recognition

take a look at the facial recognition problem

In [40]:
# import data

from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.target_names)
print(faces.images.shape)
['Ariel Sharon' 'Colin Powell' 'Donald Rumsfeld' 'George W Bush'
 'Gerhard Schroeder' 'Hugo Chavez' 'Junichiro Koizumi' 'Tony Blair']
(1348, 62, 47)
In [41]:
fig, ax = plt.subplots(3, 5)
for i, axi in enumerate(ax.flat):
    axi.imshow(faces.images[i], cmap='bone')
    axi.set(xticks=[], yticks=[],
            xlabel=faces.target_names[faces.target[i]])
In [42]:
# use a PCA to extract 150 fundamental components to feed into our support vector machine classifier.
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline

pca = PCA(n_components=150, whiten=True, random_state=42)
svc = SVC(kernel='rbf', class_weight='balanced')
model = make_pipeline(pca, svc)
In [43]:
# split data to train and test
from sklearn.model_selection import train_test_split

Xtrain, Xtest, ytrain, ytest = train_test_split(faces.data, faces.target,random_state=42)
In [44]:
#  use a grid search cross-validation to explore combinations of parameters
from sklearn.model_selection import GridSearchCV
param_grid = {'svc__C': [1, 5, 10, 50],
              'svc__gamma': [0.0001, 0.0005, 0.001, 0.005]}
grid = GridSearchCV(model, param_grid)

%time grid.fit(Xtrain, ytrain)
print(grid.best_params_)
Wall time: 31.7 s
{'svc__C': 10, 'svc__gamma': 0.001}
In [45]:
model = grid.best_estimator_
yfit = model.predict(Xtest)
In [46]:
# see predicted image
fig, ax = plt.subplots(4, 6)
for i, axi in enumerate(ax.flat):
    axi.imshow(Xtest[i].reshape(62, 47), cmap='bone')
    axi.set(xticks=[], yticks=[])
    axi.set_ylabel(faces.target_names[yfit[i]].split()[-1],
                   color='black' if yfit[i] == ytest[i] else 'red')
fig.suptitle('Predicted Names; Incorrect Labels in Red', size=14);
In [86]:
# see classification report
from sklearn.metrics import classification_report
print(classification_report(ytest, yfit,
                            target_names=faces.target_names))
                   precision    recall  f1-score   support

     Ariel Sharon       0.65      0.73      0.69        15
     Colin Powell       0.80      0.87      0.83        68
  Donald Rumsfeld       0.74      0.84      0.79        31
    George W Bush       0.92      0.83      0.88       126
Gerhard Schroeder       0.86      0.83      0.84        23
      Hugo Chavez       0.93      0.70      0.80        20
Junichiro Koizumi       0.92      1.00      0.96        12
       Tony Blair       0.85      0.95      0.90        42

         accuracy                           0.85       337
        macro avg       0.83      0.84      0.84       337
     weighted avg       0.86      0.85      0.85       337

In [47]:
#This helps us get a sense of which labels are likely to be confused by the estimator.
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(ytest, yfit)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=faces.target_names,
            yticklabels=faces.target_names)

plt.xlabel('true label')
plt.ylabel('predicted label');

Support Vector Machine Summary

We have seen here a brief intuitive introduction to the principals behind support vector machines. These methods are a powerful classification method for a number of reasons:

  1. Their dependence on relatively few support vectors means that they are very compact models, and take up very little memory.
  2. Once the model is trained, the prediction phase is very fast.
  3. Because they are affected only by points near the margin, they work well with high-dimensional data—even data with more dimensions than samples, which is a challenging regime for other algorithms.
  4. Their integration with kernel methods makes them very versatile, able to adapt to many types of data.

However, SVMs have several disadvantages as well

  1. The scaling with the number of samples $N$ is $O[N^3]$ at worst, or $O[N^2]$ for efficient implementations. For large numbers of training samples, this computational cost can be prohibitive.
  2. The results are strongly dependent on a suitable choice for the softening parameter C. This must be carefully chosen via cross-validation, which can be expensive as datasets grow in size.
  3. The results do not have a direct probabilistic interpretation. This can be estimated via an internal cross-validation (see the probability parameter of SVC), but this extra estimation is costly.

Decision Trees and Random Forests

Decision Trees

The binary splitting makes this extremely efficient: in a well-constructed tree, each question will cut the number of options by approximately half

each node in the tree splits the data into two groups using a cutoff value within one of the features

In [112]:
# sample data with four class
from sklearn.datasets import make_blobs
plt.style.use('dark_background')

X, y = make_blobs(n_samples=300, centers=4,random_state=0, cluster_std=1.0)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow');
In [113]:
# Scikit-Learn with the DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier().fit(X, y)
In [114]:
# help visualize
def visualize_classifier(model, X, y, ax=None, cmap='rainbow'):
    ax = ax or plt.gca()
    
    # Plot the training points
    ax.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=cmap,
               clim=(y.min(), y.max()), zorder=3)
    ax.axis('tight')
    ax.axis('off')
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    
    # fit the estimator
    model.fit(X, y)
    xx, yy = np.meshgrid(np.linspace(*xlim, num=200),
                         np.linspace(*ylim, num=200))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    # Create a color plot with the results
    n_classes = len(np.unique(y))
    contours = ax.contourf(xx, yy, Z, alpha=0.3,
                           levels=np.arange(n_classes + 1) - 0.5,
                           cmap=cmap, clim=(y.min(), y.max()),zorder=1)

    ax.set(xlim=xlim, ylim=ylim)
In [115]:
visualize_classifier(DecisionTreeClassifier(), X, y)
C:\Users\PK\Anaconda3\lib\site-packages\ipykernel_launcher.py:24: UserWarning: The following kwargs were not used by contour: 'clim'

Decision trees and over-fitting

it is very easy to go too deep in the tree, and thus to fit details of the particular data rather than the overall properties of the distributions

Each Tree models are inconsistent so combine information from many trees = better result!

Ensembles of Estimators: Random Forests

multiple overfitting estimators can be combined to reduce the effect of this overfitting = Bagging

In [117]:
#bagging classification can be done manually using Scikit-Learn's BaggingClassifier meta-estimatorr
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

tree = DecisionTreeClassifier()
bag = BaggingClassifier(tree, n_estimators=100, max_samples=0.8,random_state=1) # 100 trees

bag.fit(X, y)
visualize_classifier(bag, X, y)
C:\Users\PK\Anaconda3\lib\site-packages\ipykernel_launcher.py:24: UserWarning: The following kwargs were not used by contour: 'clim'

In Scikit-Learn, such an optimized ensemble of randomized decision trees is implemented in the RandomForestClassifier estimator,
which takes care of all the randomization automatically. All you need to do is select a number of estimators

In [122]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=10000, random_state=0)
visualize_classifier(model, X, y);
C:\Users\PK\Anaconda3\lib\site-packages\ipykernel_launcher.py:24: UserWarning: The following kwargs were not used by contour: 'clim'

Random Forest Regression

In [123]:
# Consider the following data, drawn from the combination of a fast and slow oscillation:
rng = np.random.RandomState(42)
x = 10 * rng.rand(200)

def model(x, sigma=0.3):
    fast_oscillation = np.sin(5 * x)
    slow_oscillation = np.sin(0.5 * x)
    noise = sigma * rng.randn(len(x))

    return slow_oscillation + fast_oscillation + noise

y = model(x)
plt.errorbar(x, y, 0.3, fmt='o');
In [124]:
# Using the random forest regressor, we can find the best fit curve as follows:
from sklearn.ensemble import RandomForestRegressor

forest = RandomForestRegressor(200)
forest.fit(x[:, None], y)

xfit = np.linspace(0, 10, 1000)
yfit = forest.predict(xfit[:, None])
ytrue = model(xfit, sigma=0)

plt.errorbar(x, y, 0.3, fmt='o', alpha=0.5)
plt.plot(xfit, yfit, '-r');
plt.plot(xfit, ytrue, '-k', alpha=0.5);

Example: Random Forest for Classifying Digits

In [48]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.keys()
Out[48]:
dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])
In [49]:
# Visualize data
# set up the figure
fig = plt.figure(figsize=(6, 6))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the digits: each image is 8x8 pixels
for i in range(64):
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
    
    # label the image with the target value
    ax.text(0, 7, str(digits.target[i]))
In [53]:
# We can quickly classify the digits using a random forest as follows:
from sklearn.model_selection import train_test_split
Xtrain , Xtest, ytrain , ytest = train_test_split(digits.data, digits.target, random_state=0)

model =RandomForestClassifier(n_estimators=1000)
model.fit(Xtrain,ytrain)
ypred = model.predict(Xtest)
In [54]:
from sklearn import metrics
print(metrics.classification_report(ypred, ytest))
              precision    recall  f1-score   support

           0       1.00      0.97      0.99        38
           1       0.98      0.95      0.97        44
           2       0.95      1.00      0.98        42
           3       0.98      1.00      0.99        44
           4       0.97      1.00      0.99        37
           5       0.98      0.96      0.97        49
           6       1.00      1.00      1.00        52
           7       1.00      0.96      0.98        50
           8       0.96      0.98      0.97        47
           9       0.98      0.98      0.98        47

    accuracy                           0.98       450
   macro avg       0.98      0.98      0.98       450
weighted avg       0.98      0.98      0.98       450

In [55]:
from sklearn.metrics import confusion_matrix

mat = confusion_matrix(ytest, ypred)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');
# We find that a simple, untuned random forest results in a very accurate classification of the digits data.

Summary of Random Forests

Random forests are a powerful method with several advantages:

  1. Both training and prediction are very fast, because of the simplicity of the underlying decision trees. In addition, both tasks can be straightforwardly parallelized, because the individual trees are entirely independent entities.
  2. The multiple trees allow for a probabilistic classification: a majority vote among estimators gives an estimate of the probability (accessed in Scikit-Learn with the predict_proba() method).
  3. The nonparametric model is extremely flexible, and can thus perform well on tasks that are underfit by other estimators.

Principal Component Analysis

PCA is a dimensionality reduction algorithm, but also useful tool for visualization, for noise filtering, for feature extraction and engineering

In [35]:
plt.style.use('classic')
#sample some data
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2,2) , rng.randn(2,200)).T
plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal');

Unsupervised learning attempts to learn about the relationship between the x and y values.
In PCA, this relationship is quantified by finding a list of the principal axes in the data, and using those axes to describe the dataset.

In [18]:
from sklearn.decomposition import  PCA
pca = PCA(n_components=2)
pca.fit(X)
Out[18]:
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)
In [19]:
print(pca.components_)
[[-0.94446029 -0.32862557]
 [-0.32862557  0.94446029]]
In [20]:
print(pca.explained_variance_)
[0.7625315 0.0184779]
In [36]:
#visualize using the components as direction vector and explained variance to define square-length of the vector
def draw_vector(v0, v1, ax=None):
    ax = ax or plt.gca()
    arrowprops=dict(arrowstyle='->',
                    linewidth=2,
                    shrinkA=0, shrinkB=0)
    ax.annotate('', v1, v0, arrowprops=arrowprops)

# plot data
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    draw_vector(pca.mean_, pca.mean_ + v)
plt.axis('equal');

These vectors represent the principal axes of the data, and the length of the vector is an indication of how "important" that axis is in describing the distribution of the data

PCA as dimensionality reduction

Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components, resulting in a lower-dimensional projection of the data that preserves the maximal data variance.

In [188]:
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape:   ", X.shape)
print("transformed shape:", X_pca.shape)
original shape:    (200, 2)
transformed shape: (200, 1)

The transformed data has been reduced to a single dimension. To understand the effect of this dimensionality reduction, we can perform the inverse transform of this reduced data and plot it along with the original data:

In [189]:
X_new = pca.inverse_transform(X_pca)
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.8)
plt.axis('equal');

The light points are the original data, while the dark points are the projected version. This makes clear what a PCA dimensionality reduction means: the information along the least important principal axis or axes is removed, leaving only the component(s) of the data with the highest variance

PCA for visualization: Hand-written digits

The usefulness of the dimensionality reduction becomes much more clear when looking at high-dimensional data

In [190]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape
Out[190]:
(1797, 64)
In [191]:
pca = PCA(2)  # project from 64 to 2 dimensions
projected = pca.fit_transform(digits.data)
print(digits.data.shape)
print(projected.shape)
(1797, 64)
(1797, 2)
In [201]:
# We can now plot the first two principal components of each point to learn about the data:
plt.scatter(projected[:, 0], projected[:, 1],
            c=digits.target, edgecolor='none', alpha=0.5,
            cmap=plt.cm.get_cmap('Spectral', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();

PCA can be thought of as a process of choosing optimal basis functions, such that adding together just the first few of them is enough to suitably reconstruct the bulk of the elements in the dataset.

Choosing the number of components

This can be determined by looking at the cumulative explained variance ratio as a function of the number of components:

In [202]:
pca = PCA().fit(digits.data)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

the digits the first 10 components contain approximately 75% of the variance, while you need around 50 components to describe close to 100% of the variance.

PCA as Noise Filtering

The idea is this: any components with variance much larger than the effect of the noise should be relatively unaffected by the noise.

So if you reconstruct the data using just the largest subset of principal components, you should be preferentially keeping the signal and throwing out the noise.

In [203]:
## Plot noise-free data
def plot_digits(data):
    fig, axes = plt.subplots(4, 10, figsize=(10, 4),
                             subplot_kw={'xticks':[], 'yticks':[]},
                             gridspec_kw=dict(hspace=0.1, wspace=0.1))
    for i, ax in enumerate(axes.flat):
        ax.imshow(data[i].reshape(8, 8),
                  cmap='binary', interpolation='nearest',
                  clim=(0, 16))
plot_digits(digits.data)
In [204]:
# Lets add some random noise to create a noisy dataset, and re-plot it:
np.random.seed(42)
noisy = np.random.normal(digits.data, 4)
plot_digits(noisy)
In [205]:
pca = PCA(0.50).fit(noisy)
pca.n_components_
#Here 50% of the variance amounts to 12 principal components. 
Out[205]:
12
In [206]:
#compute these components, and then use the inverse of the transform to reconstruct the filtered digits:
components = pca.transform(noisy)
filtered = pca.inverse_transform(components)
plot_digits(filtered)

This signal preserving/noise filtering property makes PCA a very useful feature selection routine—for example, rather than training a classifier on very high-dimensional data, you might instead train the classifier on the lower-dimensional representation, which will automatically serve to filter out random noise in the inputs.

Example: Eigenfaces

Let's take a look at the principal axes that span this dataset.

In [207]:
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.target_names)
print(faces.images.shape)
['Ariel Sharon' 'Colin Powell' 'Donald Rumsfeld' 'George W Bush'
 'Gerhard Schroeder' 'Hugo Chavez' 'Junichiro Koizumi' 'Tony Blair']
(1348, 62, 47)
In [208]:
pca = PCA(150)
pca.fit(faces.data)
Out[208]:
PCA(copy=True, iterated_power='auto', n_components=150, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)
In [209]:
# plot with first fews components
fig, axes = plt.subplots(3, 8, figsize=(9, 4),
                         subplot_kw={'xticks':[], 'yticks':[]},
                         gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
    ax.imshow(pca.components_[i].reshape(62, 47), cmap='bone')
In [210]:
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
In [212]:
# Compute the components and projected faces
pca =  PCA(150).fit(faces.data)
components = pca.transform(faces.data)
projected = pca.inverse_transform(components)
In [213]:
# Plot with 150 componenets from 3000 initial features
fig, ax = plt.subplots(2, 10, figsize=(10, 2.5),
                       subplot_kw={'xticks':[], 'yticks':[]},
                       gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i in range(10):
    ax[0, i].imshow(faces.data[i].reshape(62, 47), cmap='binary_r')
    ax[1, i].imshow(projected[i].reshape(62, 47), cmap='binary_r')
    
ax[0, 0].set_ylabel('full-dim\ninput')
ax[1, 0].set_ylabel('150-dim\nreconstruction');

Principal Component Analysis Summary

In this section we have discussed the use of principal component analysis for dimensionality reduction, for visualization of high-dimensional data, for noise filtering, and for feature selection within high-dimensional data. Because of the versatility and interpretability of PCA, it has been shown to be effective in a wide variety of contexts and disciplines. Given any high-dimensional dataset, I tend to start with PCA in order to visualize the relationship between points (as we did with the digits), to understand the main variance in the data (as we did with the eigenfaces), and to understand the intrinsic dimensionality (by plotting the explained variance ratio). Certainly PCA is not useful for every high-dimensional dataset, but it offers a straightforward and efficient path to gaining insight into high-dimensional data.

PCA's main weakness is that it tends to be highly affected by outliers in the data. For this reason, many robust variants of PCA have been developed, many of which act to iteratively discard data points that are poorly described by the initial components. Scikit-Learn contains a couple interesting variants on PCA, including RandomizedPCA and SparsePCA, both also in the sklearn.decomposition submodule. RandomizedPCA, which we saw earlier, uses a non-deterministic method to quickly approximate the first few principal components in very high-dimensional data, while SparsePCA introduces a regularization term (see In Depth: Linear Regression) that serves to enforce sparsity of the components.

In the following sections, we will look at other unsupervised learning methods that build on some of the ideas of PCA.

Manifold Learning : reduce data dimension suited with non-linear model

PCA can be used in the dimensionality reduction task—reducing the number of features of a dataset while maintaining the essential relationships between the points.
While PCA is flexible, fast, and easily interpretable, it does not perform so well when there are nonlinear relationships within the data

Manifold seeks to describe datasets as low-dimensional manifolds embedded in high-dimensional spaces by seek to learn about the fundamental two-dimensional nature of the paper

Multidimensional Scaling (MDS)

In [214]:
# sample data shape of HELLO
def make_hello(N=1000, rseed=42):
    # Make a plot with "HELLO" text; save as PNG
    fig, ax = plt.subplots(figsize=(4, 1))
    fig.subplots_adjust(left=0, right=1, bottom=0, top=1)
    ax.axis('off')
    ax.text(0.5, 0.4, 'HELLO', va='center', ha='center', weight='bold', size=85)
    fig.savefig('hello.png')
    plt.close(fig)
    
    # Open this PNG and draw random points from it
    from matplotlib.image import imread
    data = imread('hello.png')[::-1, :, 0].T
    rng = np.random.RandomState(rseed)
    X = rng.rand(4 * N, 2)
    i, j = (X * data.shape).astype(int).T
    mask = (data[i, j] < 1)
    X = X[mask]
    X[:, 0] *= (data.shape[0] / data.shape[1])
    X = X[:N]
    return X[np.argsort(X[:, 0])]
In [215]:
X = make_hello(1000)
colorize = dict(c=X[:, 0], cmap=plt.cm.get_cmap('rainbow', 5))
plt.scatter(X[:, 0], X[:, 1], **colorize)
plt.axis('equal');
In [216]:
# we can scale, shrink, or rotate the data, and the "HELLO" will still be apparent.
def rotate(X, angle):
    theta = np.deg2rad(angle)
    R = [[np.cos(theta), np.sin(theta)],
         [-np.sin(theta), np.cos(theta)]]
    return np.dot(X, R)
    
X2 = rotate(X, 20) + 5
plt.scatter(X2[:, 0], X2[:, 1], **colorize)
plt.axis('equal');

This tells us that the x and y values are not necessarily fundamental to the relationships in the data. What is fundamental, in this case, is the distance between each point and the other points in the dataset

In [218]:
# Let's use Scikit-Learn's efficient pairwise_distances function to do this for our original data:
from sklearn.metrics import pairwise_distances
D = pairwise_distances(X)
print(D.shape)
plt.imshow(D, zorder=2, cmap='Blues', interpolation='nearest')
plt.colorbar();
(1000, 1000)

MDS as Manifold Learning

The usefulness of this becomes more apparent when we consider the fact that distance matrices can be computed from data in any dimension.

In [219]:
def random_projection(X, dimension=3, rseed=42):
    assert dimension >= X.shape[1]
    rng = np.random.RandomState(rseed)
    C = rng.randn(dimension, dimension)
    e, V = np.linalg.eigh(np.dot(C, C.T))
    return np.dot(X, V[:X.shape[1]])
    
X3 = random_projection(X, 3)
X3.shape
Out[219]:
(1000, 3)
In [220]:
from mpl_toolkits import mplot3d
ax = plt.axes(projection='3d')
ax.scatter3D(X3[:, 0], X3[:, 1], X3[:, 2],
             **colorize)
ax.view_init(azim=70, elev=50)
In [222]:
# sk the MDS estimator to input this three-dimensional data, compute the distance matrix, and then determine the optimal two-dimensional embedding for this distance matrix.
from sklearn.manifold import MDS

model = MDS(n_components=2, random_state=1)
out3 = model.fit_transform(X3)
plt.scatter(out3[:, 0], out3[:, 1], **colorize)
plt.axis('equal');

This is essentially the goal of a manifold learning estimator: given high-dimensional embedded data, it seeks a low-dimensional representation of the data that preserves certain relationships within the data

The only clear advantage of manifold learning methods over PCA is their ability to preserve nonlinear relationships in the data; for that reason I tend to explore data with manifold methods only after first exploring them with PCA.

K-Means Clustering

Clustering algorithms seek to learn, from the properties of the data, an optimal division or discrete labeling of groups of points.
K-means clustering, which is implemented in sklearn.cluster.KMeans

The k-means algorithm searches for a pre-determined number of clusters within an unlabeled multidimensional dataset. It accomplishes this using a simple conception of what the optimal clustering looks like:

  • The "cluster center" is the arithmetic mean of all the points belonging to the cluster.
  • Each point is closer to its own cluster center than to other cluster centers.
In [5]:
# First, let's generate a two-dimensional dataset containing four distinct blobs.
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4,
                       cluster_std=0.60, random_state=0)
plt.scatter(X[:, 0], X[:, 1], s=50);
In [6]:
# The k-means algorithm does this automatically, 
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
In [7]:
#Let's visualize the results by plotting the data colored by these labels and  the cluster centers
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
#approach to k-means involves an intuitive iterative approach known as expectation–maximization.

K-Means Algorithm: Expectation–Maximization

the expectation–maximization approach here consists of the following procedure:

  1. Guess some cluster centers
  2. Repeat until converged
    • Expectation Step: assign points to the nearest cluster center
    • Maximization Step: set the cluster centers to the mean

Each repetition of the E-step and M-step will always result in a better estimate of the cluster characteristics.
Although E-M approach might not converged to optimal point, scikit learn do random assign first point multiple times to ensure we get to the opimal central point

In [8]:
from sklearn.datasets.samples_generator import make_blobs
from sklearn.metrics import pairwise_distances_argmin

X, y_true = make_blobs(n_samples=300, centers=4,
                       cluster_std=0.60, random_state=0)

rng = np.random.RandomState(42)
centers = [0, 4] + rng.randn(4, 2)

def draw_points(ax, c, factor=1):
    ax.scatter(X[:, 0], X[:, 1], c=c, cmap='viridis',
               s=50 * factor, alpha=0.3)
    
def draw_centers(ax, centers, factor=1, alpha=1.0):
    ax.scatter(centers[:, 0], centers[:, 1],
               c=np.arange(4), cmap='viridis', s=200 * factor,
               alpha=alpha)
    ax.scatter(centers[:, 0], centers[:, 1],
               c='black', s=50 * factor, alpha=alpha)

def make_ax(fig, gs):
    ax = fig.add_subplot(gs)
    ax.xaxis.set_major_formatter(plt.NullFormatter())
    ax.yaxis.set_major_formatter(plt.NullFormatter())
    return ax

fig = plt.figure(figsize=(15, 4))
gs = plt.GridSpec(4, 15, left=0.02, right=0.98, bottom=0.05, top=0.95, wspace=0.2, hspace=0.2)
ax0 = make_ax(fig, gs[:4, :4])
ax0.text(0.98, 0.98, "Random Initialization", transform=ax0.transAxes,
         ha='right', va='top', size=16)
draw_points(ax0, 'gray', factor=2)
draw_centers(ax0, centers, factor=2)

for i in range(3):
    ax1 = make_ax(fig, gs[:2, 4 + 2 * i:6 + 2 * i])
    ax2 = make_ax(fig, gs[2:, 5 + 2 * i:7 + 2 * i])
    
    # E-step
    y_pred = pairwise_distances_argmin(X, centers)
    draw_points(ax1, y_pred)
    draw_centers(ax1, centers)
    
    # M-step
    new_centers = np.array([X[y_pred == i].mean(0) for i in range(4)])
    draw_points(ax2, y_pred)
    draw_centers(ax2, centers, alpha=0.3)
    draw_centers(ax2, new_centers)
    for i in range(4):
        ax2.annotate('', new_centers[i], centers[i],
                     arrowprops=dict(arrowstyle='->', linewidth=1))
        
    
    # Finish iteration
    centers = new_centers
    ax1.text(0.95, 0.95, "E-Step", transform=ax1.transAxes, ha='right', va='top', size=14)
    ax2.text(0.95, 0.95, "M-Step", transform=ax2.transAxes, ha='right', va='top', size=14)


# Final E-step    
y_pred = pairwise_distances_argmin(X, centers)
axf = make_ax(fig, gs[:4, -4:])
draw_points(axf, y_pred, factor=2)
draw_centers(axf, centers, factor=2)
axf.text(0.98, 0.98, "Final Clustering", transform=axf.transAxes,
         ha='right', va='top', size=16)
Out[8]:
Text(0.98, 0.98, 'Final Clustering')

The number of clusters must be selected beforehand

Some time its difficult to answer if you dont have pre knowledge about the subject.

Alternatively, you might use a more complicated clustering algorithm which has a better quantitative measure of the fitness per number of clusters (e.g., Gaussian mixture models) or which can choose a suitable number of clusters (e.g., DBSCAN, mean-shift, or affinity propagation, all available in the sklearn.cluster submodule)

k-means is limited to linear cluster boundaries

The fundamental model assumptions of k-means (points will be closer to their own cluster center than to others) means that the algorithm will often be ineffective if the clusters have complicated geometries.

In [9]:
from sklearn.datasets import make_moons
X, y = make_moons(200, noise=.05, random_state=0)

labels = KMeans(2, random_state=0).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels,
            s=50, cmap='viridis');

This situation is reminiscent of the discussion in In-Depth: Support Vector Machines, where we used a kernel transformation to project the data into a higher dimension where a linear separation is possible. We might imagine using the same trick to allow k-means to discover non-linear boundaries.

One version of this kernelized k-means is implemented in Scikit-Learn within the SpectralClustering estimator. It uses the graph of nearest neighbors to compute a higher-dimensional representation of the data, and then assigns labels using a k-means algorithm:

In [10]:
from sklearn.cluster import SpectralClustering
model = SpectralClustering(n_clusters=2, affinity='nearest_neighbors',
                           assign_labels='kmeans')
labels = model.fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels,
            s=50, cmap='viridis');
C:\Users\PK\Anaconda3\lib\site-packages\sklearn\manifold\_spectral_embedding.py:236: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
  warnings.warn("Graph is not fully connected, spectral embedding"

k-means can be slow for large numbers of samples

Because each iteration of k-means must access every point in the dataset, the algorithm can be relatively slow as the number of samples grows

Alternatively we can use a subset of the data to update the cluster centers at each step. This is the idea behind batch-based k-means algorithms, one form of which is implemented in sklearn.cluster.MiniBatchKMeans. The interface for this is the same as for standard KMeans; we will see an example of its use as we continue our discussion.

Example 1: k-means on digits

attempt to use k-means to try to identify similar digits without using the original label information

In [56]:
# laod digit data
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape
Out[56]:
(1797, 64)
In [62]:
from sklearn.cluster import KMeans
# we know that number has 10
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape
Out[62]:
(10, 64)
In [63]:
# Let's see what these cluster centers look like:
fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

We see that even without the labels, KMeans is able to find clusters whose centers are recognizable digits, with perhaps the exception of 1 and 8.

In [64]:
# check how accurate our unsupervised clustering was in finding similar digits within the data:
# by matching each learned cluster label with the true labels found in them:
from scipy.stats import mode

labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]
    
from sklearn.metrics import accuracy_score
accuracy_score(digits.target, labels)
Out[64]:
0.7952142459654981
In [65]:
# Let's check the confusion matrix for this:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(digits.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=digits.target_names,
            yticklabels=digits.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');
# We see that the main point of confusion is between the eights and ones.
In [66]:
#  let's try to push this even farther. We can use the t-distributed stochastic neighbor embedding (t-SNE) algorithm (mentioned in In-Depth: Manifold Learning) to pre-process the data before performing k-means. 
from sklearn.manifold import TSNE

# Project the data: this step will take several seconds
tsne = TSNE(n_components=2, init='random', random_state=0)
digits_proj = tsne.fit_transform(digits.data)

# Compute the clusters
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits_proj)

# Permute the labels
labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]

# Compute the accuracy
accuracy_score(digits.target, labels)
# That's nearly 92% classification accuracy without using the labels.
Out[66]:
0.9371174179187535

Gaussian Mixture Models

Inspire by weaknesses of k-means that it use of simple distance from cluster can lead to poor performance and do not have intrinsic measure of probability of cluster assignment, So it suited with circle data shape but not oblong or elliptical and so on. These two disadvantages of k-means—its lack of flexibility in cluster shape and lack of probabilistic cluster assignment

GMM(generating new data ) attempts to find a mixture of multi-dimensional Gaussian probability distributions that best model any input dataset.

In [1]:
# standard import
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns ; sns.set()
In [123]:
#Load sample data
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=400, centers=4,
                       cluster_std=0.60, random_state=0)
X = X[:, ::-1] # flip axes for better plotting
In [124]:
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=4).fit(X)
#plot
labels = gmm.predict(X)
plt.scatter(X[:,0] , X[:,1], c=labels,s = 40, cmap='tab20b_r');

GMM contains a probabilistic model under the hood, it is also possible to find probabilistic cluster assignments

In [125]:
# return probability of assign to cluster
probs = gmm.predict_proba(X)
probs[:5].round(2)
Out[125]:
array([[0.  , 0.53, 0.  , 0.47],
       [0.  , 0.  , 1.  , 0.  ],
       [0.  , 0.  , 1.  , 0.  ],
       [0.  , 1.  , 0.  , 0.  ],
       [0.  , 0.  , 1.  , 0.  ]])
In [126]:
# we can use the size of the cluster to reflect the probability of assignment (small = not sure = low prob)
size = 50 * probs.max(1)**2 # .max() get only the highest value which is the one that got assigned
plt.scatter(X[:,0],X[:,1], s = size , cmap='tab20b_r', c= labels);

Under the hood, a Gaussian mixture model is very similar to k-means: it uses an expectation–maximization approach which qualitatively does the following:

  1. Choose starting guesses for the location and shape

  2. Repeat until converged:

    1. E-step: for each point, find weights encoding the probability of membership in each cluster
    2. M-step: for each cluster, update its location, normalization, and shape based on all data points, making use of the weights

The result of this is that each cluster is associated not with a hard-edged sphere, but with a smooth Gaussian model

In [127]:
#  create a function that visualize the locations and shapes of the GMM clusters by drawing ellipses based on the GMM output
from matplotlib.patches import Ellipse

def draw_ellipse(position, covariance, ax=None, **kwargs):
    """Draw an ellipse with a given position and covariance"""
    ax = ax or plt.gca()
    
    # Convert covariance to principal axes
    if covariance.shape == (2, 2):
        U, s, Vt = np.linalg.svd(covariance)
        angle = np.degrees(np.arctan2(U[1, 0], U[0, 0]))
        width, height = 2 * np.sqrt(s)
    else:
        angle = 0
        width, height = 2 * np.sqrt(covariance)
    
    # Draw the Ellipse
    for nsig in range(1, 4):
        ax.add_patch(Ellipse(position, nsig * width, nsig * height,
                             angle, **kwargs))
        
def plot_gmm(gmm, X, label=True, ax=None):
    ax = ax or plt.gca()
    labels = gmm.fit(X).predict(X)
    if label:
        ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)
    else:
        ax.scatter(X[:, 0], X[:, 1], s=40, zorder=2)
    ax.axis('equal')
    
    w_factor = 0.2 / gmm.weights_.max()
    for pos, covar, w in zip(gmm.means_, gmm.covariances_, gmm.weights_):
        draw_ellipse(pos, covar, alpha=w * w_factor)
In [128]:
# see the shape of cluster
gmm = GaussianMixture(n_components=4)
plot_gmm(gmm, X)
In [131]:
# try with ellip shape data
rng = np.random.RandomState(13)
X_stretched = np.dot(X, rng.randn(2, 2))

gmm = GaussianMixture(n_components=4, covariance_type='full', random_state=42)
plot_gmm(gmm, X_stretched)

Choosing the covariance type

This hyperparameter controls the degrees of freedom in the shape of each cluster3
From less complex to more complex (computation expensive)

  1. covariance_type="spherical" : cycle (similar to kmean)
  2. covariance_type="diag" : ellipse (default param)
  3. covariance_type="full" : grow along data

GMM as Density Estimation

The result of a GMM fit to some data is technically not a clustering model, but a generative probabilistic model describing the distribution of the data.

In [147]:
from sklearn.datasets import make_moons
Xmoon , ymoon = make_moons(n_samples = 400,noise = 0.05)
plt.scatter(Xmoon[:,0],Xmoon[:,1], s = 20);
In [154]:
# Not useful as a clustering 
gmm2 = GaussianMixture(n_components=2, covariance_type='full').fit(Xmoon)
labels = gmm2.predict(Xmoon)
plot_gmm(gmm2,Xmoon)
In [195]:
# we can use GMM as a generative model
gmm16 = GaussianMixture(n_components=16, covariance_type='full', random_state=0)
plot_gmm(gmm16, Xmoon, label=False)

This is mixture of 16 Gaussians serves not to find separated clusters of data, but rather to model the overall distribution of the input data. GMM gives us the recipe to generate new random data distributed similarly to our input.
For example, here are 400 new points drawn from this 16-component GMM fit to our original data:

In [196]:
Xnew = list(gmm16.sample(400))
In [197]:
# create random sample with same shape 
plt.scatter(Xnew[0][:, 0], Xnew[0][:, 1]); 

Detemining number of components(cluster) in parameter

we can simply evaluate the likelihood of the data under the model, using cross-validation to avoid over-fitting.
Another means of correcting for over-fitting is use Akaike information criterion (AIC) or the Bayesian information criterion (BIC)

In [200]:
# lets look at AIC and BIC of make_moon dataset
number_of_compenents = np.arange(1,21) 
#create model with list comprehension to loop n values
models = [GaussianMixture(n_components= n,covariance_type='full', random_state=0).fit(Xmoon) for n in number_of_compenents]
#plot
plt.plot(number_of_compenents, [m.bic(Xmoon) for m in models], label='BIC')
plt.plot(number_of_compenents, [m.aic(Xmoon) for m in models], label='AIC')
plt.legend(loc='best')
plt.xlabel('n_components');

The optimal number of clusters is the value that minimizes the AIC or BIC. AIC and BIC recommend 10 cluster.
Notice the important point: this choice of number of components measures how well GMM works as a density estimator, not how well it works as a clustering

Example GMM : Generating New data

Using GMM generate new handwritten digits

In [208]:
#import data
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape
Out[208]:
(1797, 64)
In [212]:
# create function to plot digit
def plot_digits(data):
    fig, ax = plt.subplots(10, 10, figsize=(8, 8),
                           subplot_kw=dict(xticks=[], yticks=[]))
    fig.subplots_adjust(hspace=0.05, wspace=0.05)
    for i, axi in enumerate(ax.flat):
        im = axi.imshow(data[i].reshape(8, 8), cmap='binary')
        im.set_clim(0, 16)
plot_digits(digits.data)

GMMs can have difficulty converging in such a high dimensional space, so we will start with an invertible dimensionality reduction algorithm on the data. Here we will use a straightforward PCA, asking it to preserve 99% of the variance in the projected data:

In [234]:
# PCA before GMM
from sklearn.decomposition import PCA
pca = PCA(0.99, whiten=True)
data = pca.fit_transform(digits.data)
#plot
data.shape
Out[234]:
(1797, 41)

We can reduce from 64 to 41 dimension with almost no info lost.
lets use AIC to see how many GMM component we should use.

In [237]:
n_components = np.arange(50, 210, 10)
models = [GaussianMixture(n, covariance_type='full')
          for n in n_components]

aics = [model.fit(data).aic(data) for model in models]
plt.plot(n_components, aics);

It appears that around 140 components minimizes the AIC.
lets try 140 with GMM

In [243]:
gmm = GaussianMixture(n_components=140)
gmm.fit(data)
print(gmm.converged_)
True
In [251]:
data_new = gmm.sample(100)[0]
data_new.shape
Out[251]:
(100, 41)

Finally, we can use the inverse transform of the PCA object to construct the new digits:

In [252]:
digits_new = pca.inverse_transform(data_new)
plot_digits(digits_new)

Kernel Density Estimation

GMM is kinda hybrid between cluster estimator and density estimator.
Recall that a density estimator is an algorithm which takes a D-dimensional dataset and produces an estimate of the D-dimensional probability distribution which that data is drawn from. The GMM algorithm accomplishes this by representing the density as a weighted sum of Gaussian distributions.
Kernel density estimation uses a mixture consisting of one Gaussian component per point, resulting in an essentially non-parametric estimator of density

KDE: Histograms

The free parameters of kernel density estimation are the kernel, which specifies the shape of the distribution placed at each point, and the kernel bandwidth, which controls the size of the kernel at each point

In [266]:
# create some data that is drawn from two normal distributions:
def make_data(N, f=0.3, rseed=1):
    rand = np.random.RandomState(rseed)
    x = rand.randn(N)
    x[int(f * N):] += 5
    return x

x = make_data(1000)
x_d = np.linspace(-4, 8, 1000)
In [276]:
from sklearn.neighbors import KernelDensity

# instantiate and fit the KDE model
kde = KernelDensity(bandwidth=0.35, kernel='gaussian')
kde.fit(x[:, None])

# score_samples returns the log of the probability density
logprob = kde.score_samples(x_d[:, None])

plt.fill_between(x_d, np.exp(logprob), alpha=0.5)
plt.plot(x, np.full_like(x, -0.01), '|k', markeredgewidth=1)
plt.ylim(-0.02, 0.22)
Out[276]:
(-0.02, 0.22)

Selecting the bandwidth via cross-validation

Here we will use GridSearchCV to optimize the bandwidth for the preceding dataset. Because we are looking at such a small dataset, we will use leave-one-out cross-validation, which minimizes the reduction in training set size for each cross-validation trial:

In [274]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import LeaveOneOut

bandwidths = 10 ** np.linspace(-1, 1, 100)
grid = GridSearchCV(KernelDensity(kernel='gaussian'),
                    {'bandwidth': bandwidths},
                    cv=LeaveOneOut())
grid.fit(x[:, None]);
In [275]:
# the choice of bandwidth which maximizes the score (which in this case defaults to the log-likelihood)
grid.best_params_
Out[275]:
{'bandwidth': 0.35111917342151316}